Text Classifier

Overview

The Palladian Text Classifier node collection provides a dictionary-based classifier for text documents. Using a set of labeled sample documents, one can build a dictionary and use it to classify uncategorized documents. Typical use cases for text classification are e.g. automated email spam detection, language identification, or sentiment analysis. The Palladian classifier won the first Research Garden competition where the goal was to classify product descriptions into eight different categories. See press release (via archive.org).

Nodes

Following the KNIME conventions, the Palladian Text Classifier is consists of a learner and a predictor node:

  • TextClassifierLearner: Create a model using pre-categorized list of text documents. The model is a weighted term look up table which indicates, how probable each feature (see below) is for a given category. This model is used by the corresponding predictor node.
  • TextClassifierPredictor: Using a previously built model, this node predicts the categories for uncategorized text documents by looking up the relevance scores in the model.
  • TextClassifierModelPruner: Allows to apply different pruning strategies to a model to reduce its size.
  • TextClassifierModelReader: This node allows the deserialization of an existing Text Classifier model.
  • TextClassifierModelWriter: This node allows serializing a trained Text Classifier model, so that it can be reused later, either in different KNIME workflows, or programmatically within Palladian.
  • TextClassifierModelToTable: Allows to transform the content of a model to a KNIME table.

Feature settings

Features are the input for a classifier. In text classification, we have a long string as an input from which we need to derive features during preprocessing. Palladian's text classifier works with n-grams. n-grams are sets of tokens of the length n, which are created by sliding a "window" over the given text. The PalladianTextClassifierLearner node can create features using character- or word-based n-grams. As an example, consider the text "the quick brown fox":

  • The set of word-based 2-grams would contain the following entries: {"the quick", "quick brown", "brown fox"}.
  • The set of character-5-grams consists of the following entries: {"the q", "he qu", "e qui", " quic", "quick", …}.
  • It is possible to combine n-grams of different lengths. For example, the set of character-4-6-grams contains the union of the sets of 4-, 5-, and 6-grams.

Example workflows

Palladian polyglot: Building a language detector using a massive parallel corpus

The following example workflow creates a language detector using the JRC-Acquis corpus. The corpus consists of over 4 million documents in 22 of the official languages of the European Union. The workflow uses some custom code in a Java Snippet node to parse the input data, and then builds a model using half of the corpus documents using the TextClassifierLearner node. The resulting model is reduced using the TextClassifierModelPruner. Using the remaining half of the documents, we evaluate the classification accuracy.

While the results show very good accuracy on the JRC documents (99.98%), we also want to find out, how the classifier performs with documents from a different domain. Therefore we create a separate test set using randomly selected articles from different languages of the Wikipedia. We use Palladian's HttpRetriever, HtmlParser, and ContentExtractor nodes to build the test set. The Scorer node shows an accuracy of 98.27% for classification of the Wikipedia pages. As a third evaluation step, we want to determine how the language detection performs on very short strings. We use the sentence 'My hovercraft is full of eels' translated in different languages. Classifying very short sequences of text is naturally much harder, yet we can correctly classify 90% of the samples.

In case you you want to play with the dataset on your own, you can download the corpus files here. Simply pick the languages which you want to classify (.tgz files), unzip the files and put all resulting directories in an enclosing directory which you select in the initial List Files nodes. Download the workflow

.

Beside that, we provide a ready-to-use language classification model for

, which was created using the depicted workflow. Simply use a TextClassifierModelReader to read the model (you can directly drag a palladianDictionaryModel file to your KNIME workspace to insert a readily configured model reader), connected to a TextClassifierPredictor to perform language detection on your documents.