Deeplearning4J - Text Processing

 

 

Deeplearning4J Integration - Text Processing

 

Overview

The Textprocessing Extension for the KNIME Deeplearning4J Integration adds the Word Vector functionality of Deeplearning4J to KNIME. This means that you can create so called Neural Word Embeddings which can be very useful in many applications.

Installation

Installation details can be found here.

Word Embeddings

Generally, a Neural Word Embedding is just a numerical representation (vector of real numbers) that represents a word or even a whole document. This can be done using methods called Word2Vec or Doc2Vec that train a neural network which learns the structure of the text. One nice property of this network is that the size of the output layer can be user specified which directly correspond to the size of the resulting vector. Hence, we can choose what dimensionality our output should have. Therefore, we can effectively describe text in a low dimensional vector space.

Nodes

The following Nodes are contained in the Textrocessing Extension for the KNIME Deeplearning4J Integration:

Word Vector Learner

The Word Vector Learner Nodes are used to train a Word2Vec or Doc2Vec model. The input of the Node can either be a String column or a Document column from KNIME Textprocessing. The Doc2Vec Node additionally expects an String column containing a class attribute for each document. The output of such a node is a trained word vector model which can be used by other nodes of this extension. All learning specific parameters like the size of the output layer can be adjusted in the Node Dialog.

[KNIME 3.2 - 3.3] Word Vector Learner

[KNIME 3.4 - 3.5] DL4J Word2Vec Learner, Doc2Vec Learner

Vocabulary Extractor

The Vocabulary Extractor Node takes in a word vector model and outputs the learned words with corresponding word vectors as a table.

Word Vector Apply

The Word Vector Apply Node is used to apply a learned word vector model to documents. This is used to convert a document into a word vector representation using a previously trained model. By default the Node replaces all word contained in the input document by the corresponding word vector from the model. Word which are not contained in the model will be skipped. This results in a collection of word vectors for each input document. Additionally, the Node has a option to calculate the mean of all word vectors which will result in one singe vector which represents the document.

Word Vector Model I/O

The Word Vector Model Writer and Reader Node allow to save a trained Word Vector Model to disk and share the models to be used in another workflow. Thereby, the models will be saved in a KNIME wrapped DL4J format (see node description for details) by the Writer Node.

The Reader is able to read in the models saved by the Writer, as well as external pretrained models in specific formats (see node description of the Reader for details). Some pretrained model that are supported can be found here:

Google News Vectors (from: https://code.google.com/archive/p/word2vec/) Note: Very large model, may take some time to read.

GloVe Models (from: https://nlp.stanford.edu/projects/glove/) Note: These models are in pain text format.

Example Screenshots

 

Examples

Examples Workflows can be found on the public Example Server.

Known Issues

Deeplearning4J IntegrationTextprocessing and GPU

Due to the implementation of the textprocessing functionality in the Deeplearning4J library, it is recommended to execute all Deeplearning4J Integration -Textprocessing nodes using the CPU. Performance on GPU may be significantly lower.

Licence

The KNIME Deeplearning4J Integration is available under the same Licence as KNIME.