Example workflows including a detailed description, workflow annotations and the necessary data are provided on this page. The workflows cover standard text mining tasks, such as classification and clustering of documents, named entity recognition and creation of tag clouds. But also other tasks, such as computation of co-occurrences of terms or named entities in combination with an appropriate visualization. Additionally it is shown how web sites, or rss feeds in particular can be downloaded, mined and visualized.
In most workflows a data set is used, consisting of two sets of PubMed documents. The first set contains documents about human and aids, and the second set contains documents about mouse and cancer. The data sets can be downloaded below in the attachment section. To read the data the Table Reader node has to be used and KNIME Textprocessing version 2.9 or later is required.
For further examples of the Textprocessing feature in combination with other KNIME features, such as e.g. the network mining feature please have a look at the KNIME white papers section, where the papers as well as the workflows can be downloaded.
- Usable Customer Intelligence from Social Media Data: Network Analytics meets Text Mining
- Usable Customer Intelligence from Social Media Data: Clustering the Social Community
- The KNIME Textprocessing Feature: An Introduction
In the following two examples of the DML and SDML format are available. The main difference between these two xml based document formats is that in SDML the section segments can contain complete sentences or chapters, as plain text. The text within a SDML section segment is tokenized on the fly during the parsing process, unlike the text of a DML file. The DML sections contain paragraphs, sentences, terms and finally words, so that the tokenization has to be applied to the text before it can be represented in DML format. The root of DML is "document" (singular) whereas the root of SDML is "documents" (plural). Meaning that a valid and well-formed DML can contain only one document per file, and SDML can contain several documents per file.