KNIME Textprocessing version 2.9 or later is required to load and execute this workflow.
The workflow starts with a list of documents, which have been downloaded from PubMed and parsed beforehand and saved as data table. The data is available as drop file in the corresponding drop directories.
The documents are assigned to two categories and are split, based on the category assignments, into two sets. The first set consists of documents about human and aids, the second set consists of documents about mouse and cancer.
Part of speech tags as well as gene names are recognized and assigned by the corresponding tagger (POS tagger and Abner tagger), in order to assign a color based on a tag type later on.
The textual data is preprocessed by various filters and a stemmer node. Then the assigned tags are extracted for each term and transformed into strings (for coloring purposes). Afterwards the term frequencies are computed and a color is assigned to each term, based on the assigned tag. Finally the Tag Cloud is used to visualize the terms.
|Tag Cloud of the most frequent terms. Words are colored green if they represent genes or proteins otherwise blue.|