Sentiment analysis of free-text documents is a common task in the field of text mining. In sentiment analysis predefined sentiment labels, such as "positive" or "negative" are assigned to texts. Texts (here called documents) can be reviews about products or movies, articles, etc.
In this blog post we show an example of assigning predefined sentiment labels to documents, using the KNIME Text Processing extension in combination with traditional KNIME learner and predictor nodes.
A set of 2000 documents has been sampled from the trainings set of the Large Movie Review Dataset v1.0. The Large Movie Review Dataset v1.0 contains 50000 English movie reviews along with their associated sentiment labels "positive" and "negative". For details about the data set see http://ai.stanford.edu/~amaas/data/sentiment/. We sampled 1000 documents of the positive group and 1000 documents of the negative group. The goal here is to assign the correct sentiment label to each document.
The workflow associated with this post is available for download in the attachment section of this post. or in the EXAMPLES server under
The KNIME Text Processing extension enables to read, process, mine and visualize textual data in KNIME. It provides functionality from natural language processing, text mining and information retrieval. For more information see http://tech.knime.org/knime-text-processing.
Reading Text Data
The workflow starts with a "File Reader" node, reading a csv file, that contains the review texts, its associated sentiment label, the IMDb URL of the corresponding movie and its index in the Large Movie Review Dataset v1.0. Important in this blog post are the text column as well as the sentiment column. In the first meta node "Document Creation" document cells are created from the string cells, using the "Strings to Document" node. The sentiment labels are stored in the category field of each document in order to extract the category afterwards. All columns except the column containing the document cells are filtered afterwards.
The output of the first meta node "Document Creation" is a data table with only one column containing the document cells.
The textual data is preprocessed by various nodes provided by the KNIME Text Processing extension. All preprocessing steps are applied in the second meta node "Preprocessing", as shown in the following figure.
First, punctuation marks are removed by the "Punctuation Erasure" node, numbers and stop words are filtered, and all terms are converted to lower case. Then the word stem is extracted for each word using the "Snowball Stemmer" node. Indeed, selection, selecting and select refer to the same lexical concept and carry the same information in a document classification or topic detection context. Besides English texts, the Snowball Stemmer node can be applied on texts of various languages, e.g. German, French, Italian, Spanish, etc. The node is using the Snowball stemming library.
Feature Extraction and Vector Creation
After all this pre-processing to prepare the text, we reach a central point of this analysis: extract the terms, that we want to use as features in the document vectors, and thus take into account for the classification of documents afterwards. To be able to create document vectors, using the "Document vector" node, first a bag of words (bow) data table has to be created, using the "BoW Creator" node. The "Document vector" node requires a bow as input data table and takes into account all terms contained in this bow to create the document vectors.
After the bow has been created we filter out all terms that occur in less than 20 documents. We do this by grouping by the terms, counting all unique documents containing these terms, filtering this list of terms, and finally filtering the bow with the "Reference Row Filter" node. Thereby we reduce the feature space from 22105 distinct words to 1500. The feature extraction is part of the "Preprocessing" meta node and can be seen in the next figure.
We set the minimum number of documents to 20 since we assume that a term has to occur in at least 1% of all documents (20 of 2000) in order to represent a useful feature for classification later on. This is a rule of thumb and of course can be changed individually.
Based on these extracted words (features), document vectors are then created. The document vectors are numerical representations of documents and are in the following used for classification by a decision tree classifier. The "Document vector" node allows for the creation of bit vectors or numerical vectors. As numerical values, previously calculated scores or frequencies can be used, e.g. by the "TF" or "IDF" nodes. In our case bit vectors are created.
For the classification we can use any of the traditional mining algorithms available in KNIME (e.g. decision trees, ensembles, support vector machines, and much more). As in all supervised mining algorithms we need a target variable. In our example the target is the sentiment label, which is stored as category in the documents. The target or class column is extracted from the documents and appended as string column, using the "Category to class". This category can then be used as the target class for the classification procedure. Based on the category a color is assigned to each document by the Color Manager node. Documents with label "positive" are colored green, documents with label "negative" are colored red.
As classification algorithms we used a decision tree applied on a training (70%) and test set (30%). The accuracy of the decision tree is 93.667% the corresponding ROC curve in the following figure.
The next figure shows a view of the first two levels of the decision tree. The most discriminative terms w.r.t. the separation of the two classes are "bad", "wast", and "film". If the term "bad" occurs in a document it is likely to have a negative sentiment. If "bad" does not occur but "wast" (stem of waste) it is again likely to be a negative document, and so on.
Of course other learners, such as the "Tree Ensemble Learner", "Naive Bayes Learner", or "SVM Learner" could be applied as well and the preprocessing chain could be optimized for other classification algorithms. Instead of bit vectors, also numerical vectors could be created by the Document vector node. Furthermore ngram features could be used in addition to single words to take into account negations, such as "not good" or "not bad".