Author: Jeanette Prinz
In a previous blog post, I discussed visualizations in KNIME Analytics Platform. Having recently moved to Berlin, I have been paying more attention to street graffiti. So today, we will be learning how to tag.
...just kidding. Sort of.
Our focus will be on tagging, but the text-mining (rather than street art) variety: We will learn how to automatically tag disease names in biomedical literature.
The rapid growth in the amount of biomedical literature becoming available makes it impossible for humans alone to extract and exhaust all of the useful information it contains. There is simply too much there. Despite our best efforts, many things would fall through the cracks, including valuable disease-related information. Hence, automated access to disease information is an important goal of text-mining efforts1. This enables, for example, the integration with other data types and the generation of new hypotheses by combining facts that have been extracted from several sources2.
In this blog post, we will use KNIME Analytics Platform to create a model that learns disease names in a set of documents from the biomedical literature. The model has two inputs: an initial list of disease names and the documents. Our goal is to create a model that can tag disease names that are part of our input as well as novel disease names. Hence, one important aspect of this project is that our model should be able to autonomously detect disease names that were not part of the training.
To do this, we will automatically extract abstracts from PubMed and use these documents (the corpus) to train our model starting with an initial list of disease names (the dictionary). We then evaluate the resulting model using documents that were not part of the training. Additionally, we test whether the model can extract new information by comparing the detected disease names to our initial dictionary.