KNIME Textprocessing version 2.9 or later and the R integration feature is required to load and execute this workflow.
This workflow starts with a list of genes, queries PubMed with each single gene name and downloads the resulting documents containing each gene name via a loop. The Document Grabber node requires an existing and empty directory for each query, to save the documents to. These directories are created and emptied beforehand by Java Snippet nodes (change path in the dialog of the Create Temp Dir node). The query containing each gene name, one at a time, is passed to the Document Grabber node via a flow variable. The Document Grabber queries PubMed, downloads and parses the documents, which are then represented in KNIME as DocumentCell s, additionally this node assigns the query to the resulting documents as document category. This information can be extracted afterwards in order to find out which document has been found by which query gene.
The Dictionary Tagger node is used afterwards to recognize important terms, which are interesting. The tagger assigns a specified tag (in this case “LOCATION”) to the recognized terms. To filter named entities the Standard Named Entity Filter node is used, in this case terms with a “LOCATION” tag assigned. Terms are then converted to lower case and a bag of words is created.
In a further step the Category to class node extracts the category of each single document and assigns it as a string in an additional column. Since for each document the Document Grabber node assigned the corresponding query by which the document was found as document category, this category represents the name of a gene occurring in the document (otherwise it would not be in the PubMed result list). Now the data can simply be grouped over the extracted category and the remaining important terms (all other terms have been filtered), in order to compute the number of documents in which a certain gene (query term) and an important term co-occur. The grouping is done by the usage of the GroupBy node. Afterwards column filtering has been applied, as well as a renaming of the remaining columns, missing values have been replaced by 0, and the co-occurrence frequencies have been normalized. Then the R View (Local) node is used to create a heat map based on these frequencies.
The heatmap with the gene term coocurrences.