In the preprocessing step terms are filtered and manipulated in order to get rid of terms that do not contain content, such as stop words, numbers, punctuation marks, or very small words, or to remove endings based on declination or conjugation by applying stemming. Usually the bag of words is cleaned in the preprocessing step and only those terms are left over, which are afterwards used to create a numerical vector or that are visualized in a tag cloud.
Besides the regular preprocessing nodes, such as stemming, stop word filtering, etc. there are various other preprocessing nodes available in the Text Processing plugin to manipulate and filter terms. For each tag type (tagger) there is a corresponding filter node that filters terms with certain tag values assigned. The tag values can be specified in the dialog of the node. The Stanford tagger for example assigns part of speech tags of the STTS tag set to German texts. Correspondingly the STTS filter filters terms which have certain STTS tags assigned. This combination of tagging and filtering allows for a powerful identification and extraction for named entities of different types. It is for example easily possible to identify and extract gene names from PubMed articles and visualize them in a tag cloud.
Other very powerful preprocessing nodes are RegEx filter and Replacer which are both based on regular expressions that can be specified in the dialog of the node. The Snowball Stemmer node allows for the stemming of texts in various languages, such as English, German, Spanish, Italian, and may more. The Dict Replacer node replaces certain terms, which are specified in a dictionary with certain other terms, which are specified in the dictionary too. This node allows e.g. a replacement of certain names by certain synonyms.
Data table
All preprocessing nodes require as input data table a bag of words, which need to be created by the BoW creator node based on a list of documents. After the enrichment step is accomplished, the BoW creator node is usually used to transform the documents into a bag of words, on which afterwards the preprocessing nodes are applied. A bag of words data table contains of a term column, consisting of terms and a document column consisting of documents. A row in such a data table, e.g. consisting of term “t1” and document “d2” represents the fact that term “t1” is contained in document “d2”. Thus for each occurrence of a term in a document there exists a row in the corresponding bag of words data table.
Dialog
The dialogs of all preprocessing nodes contain a “Preprocessing” tab, in which three settings can be specified:
- Deep preprocessing: the preprocessing of a term, e.g. stemming is on the one hand applied on each term contained in the term column of the bag of words data table. On the other hand it can be applied on the terms contained in the documents of the document column, which is called “deep preprocessing”. This is useful if e.g. after preprocessing frequencies have to be computed. If the terms are only preprocessed in the term column but not in the corresponding document itself, the preprocessed term cannot be found in document afterwards and thus not be counted. Deep preprocessing is applied by default.
- Appending: if deep preprocessing is applied, terms in the documents are preprocessed, i.e. filtered and manipulated as well. Sometimes it is useful to keep additionally the original document. Therefore this option can be selected, which is set by default as well. The original, unchanged documents are appended as additional column in the bag of words data table.
- Unmodifiable policy: By default all preprocessing nodes do not apply preprocessing on terms that have been set unmodifiable beforehand by a certain tagger node. This unmodifiable flag can be ignored.