To process texts with the KNIME Text Processing plugin usually six different steps need to be accomplished. These steps are:
- IO: reading and parsing
- Enrichment: named entity recognition
- Preprocessing: filtering and manipulation
- Frequencies: word counting and keyword extraction
- Transformation: bow and vector representation
- Visualization: tag cloud
For each of these steps there exists a category folder in the Text Processing node repository (except for the visualization nodes, which are located in Misc).
The figure below shows an example workflow in which first PubMed is queried and the resulting documents are downloaded and parsed. In the enrichment step a part of speech tagger is applied in order to assign part of speech tags to each term and in addition named entity recognition is used to identify gene and protein names and tag the corresponding terms. Afterwards the documents are transformed into a bag of words, on which than various preprocessing nodes are applied, such as stop word filtering or stemming. Once the preprocessing is done, frequencies can be computed and important terms can be extracted, based on which for each document (or alternatively term) a binary or numerical vector can be computed. Based on these vectors regular KNIME nodes can be applied in order to mine the now numerical data. Furthermore visualization nodes, such as the tag cloud can be used to visualize the extracted terms. In the following the six steps are described in detail.
A technical report about the KNIME Text Processing feature is available here.