Document Classification Example

Requirements:

KNIME Textprocessing version 2.9 or later is required to load and execute this workflow.

Description:

The workflow starts with a list of documents, which have been downloaded from PubMed and parsed beforehand and saved as data table. The data is available as drop file in the corresponding drop directories.

The documents are assigned to two categories and are split, based on the category assignments, into two sets. The first set consists of documents about human and aids, the second set consists of documents about mouse and cancer.

The textual data is preprocessed by various filters and a stemmer node. Then the most important keywords are extracted and, based on these keywords, the documents are transformed into document vectors.

The document vectors are a numerical representation of documents and are in the following used for classification via a decision tree, support vector machine and k nearest neighbor classifier.

Download workflow

Text Processing classification example workflow
 

The decision tree view shows the terms with the highest information gain, i.e.: IL-18, tumor, mice, etc.

Textprocessing classification decision tree view