KNIME Textprocessing version 2.9 or later and the Distance Matrix feature is required to load and execute this workflow.
The workflow starts with a list of documents, which have been downloaded from PubMed and parsed beforehand and saved as data table. The data is available as drop file in the corresponding drop directories.
The documents are assigned to two categories and are split, based on the category assignments, into two sets. The first set consists of documents about human and aids, the second set consists of documents about mouse and cancer.
The textual data is preprocessed by various filters and a stemmer node. Then the most important keywords are extracted and, based on these keywords, the documents are transformed into document vectors.
The document vectors are a numerical representation of documents and are in the following used for hierarchical clustering based on Manhattan and Euclidean distance measures.
The following pictures illustrate the dendogram and the hierarchically clustered data points (mouse cancer in red, human aids in blue).
|Euclidean distances||Manhattan distances|