This workflow demonstrates the seamless corporation between multiple KNIME plug-ins in order to analyze data from DrugBank a public available db that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. Each drug is described by a drug card with more than 150 data fields of which only some are used in this workflow for demonstration. All necessary files are available at the download section.

Advanced Analysis (01203001_drugBank)

Advanced Network Analysis Workflow

The workflow demonstrates the integration of heterogeneous information into a single information network and the subsequent analysis of the integrated data. Drugs and features such as the category a drug belongs to or its targets as well as genes, proteins etc. that are mentioned in the description of a drug are represented as nodes. Edges connect features with the drug they describe resulting in a network with 20,000 nodes and 160,000 edges. The following sections describe the different sections of the workflow in detail.

Network creation

This section is the backbone of the workflow that fuses all the extracted information into a single homogenous network structure. It consists of several "Object Inserter" nodes that connect the drugs with the corresponding information. Edge ids are always created based on the names of the nodes they connect. Each node is also assigned to the partition that describes its origin such as the drug partition that contains all drugs or the target partition that contains all target related nodes. The final network contains of more than 20,000 nodes representing drugs and their features and 200,000 edges connecting drugs with their describing features.

XML data extraction

The XML data extraction section reads in an XML file with drug information. Once the file is loaded in KNIME the XML Processing nodes can be used to extract several information from the file by using the XPath node. The XML data extraction is hidden in the different Meta nodes such as the "Extract Drugs" node. The Row Filter ensures that detailed information is only extracted from the approved drugs.

The "Extract ATC codes" Meta node uses the XML nodes first to extract the ATC codes for each drug. The ATC code classifies drugs into different groups at 5 different levels. We use the web retrieval nodes from the Palladian feature to extract the information for the first three levels of each ATC code from the home page of the WHO Collaborating Centre for Drug Statistics Methodology.

In the "Extract Text Features" Meta node we extract further textual features from the XML file such as the affected organisms and the manufacturer. This information including other free text fields such as description and pharmacology are analyzed in the Text mining part of the workflow which is described in the next section.

Text mining of free form text fields such as description

The Text mining section converts the free form texts of each drug that should be considered into a document cell using the "Strings To Document" node. Each document represents one drug and the concatenated free from texts the content of the document. Once we have a document cell we can start to analyze each document using the nodes from the Text Processing plug-in.

Prior converting the document into a bag of words representation we use several tagger nodes within the "Term Tagging" Meta node. The first tagger we use is the POS tagger to identify the different parts of speech such as nouns and verbs. The "Abner tagger" is used to identify biomedical named entities such as proteins, cells, dna and rna. Finally, we use the "Dictionary tagger" to identify the names of all drugs that we have extracted from the DrugBank xml file in the free form texts.

The resulting table of the "Term Tagging" node is then filtered based on the tags. Each filter result is than converted into a link table with the drug the document represents as source nodes and the detected drugs within the document text as target nodes. The link tables are divided into the drug-drug, drug-biomedical entity and drug noun interactions.

Molecular Fragment Mining

This section uses the chemistry analysis power of KNIME in order to identify molecular fragments that occur frequent in the approved drugs. It first reads in the SDF file provided in the download section of DrugBank that contains the structure of each approved drug.

The "Molecular Fragment Mining" node uses the MoSS (Molecular substructure search) node to find the frequent molecular fragments in all approved drug molecule structures. Each fragment gets a canonical smiles name assigned using the OpenBabel node. The Meta node further calculates an appropriate size for each fragment depending on the number of atoms it consists of. The size is later on used in the Network Viewer to adapt the size of the node images based on the complexity of the structure they depict.

Basic network analysis and visualization

This section demonstrates various possibilities to detect interesting pattern or extract useful information from the created information network. It also demonstrates how molecule structures can be visualized in the "Network Viewer" node.

Network Viewer Example



Memory Requirements

The workflow uses resource intensive operations such as XML XPath queries and text mining operations that require at least 2 GB of main memory. For a description on how to increase the Java Heap Space of KNIME, see the FAQ section.

Required Features

This workflow utilizes the following plug-ins to extract various information from heterogeneous data sources. All features can be installed via the update mechanism of KNIME. For a detailed description on how to install new features see here.

Allows the extraction of information from XML files and is available via the standard KNIME update site.
Base Chemistry Types & Nodes
Provides basic chemistry types such as SDF and nodes and is also available via the standard KNIME update site.
Chemistry Add-Ons
Provides several KNIME chemistry nodes such as MoSS (Molecular substructure search) and OpenBabel for converting various chemistry formats also available via the standard KNIME update site.
Nodes to create KNIME Quick Forms
This feature is also available via the standard KNIME update site and provides nodes that allow among others the parameterization of Meta nodes.
Network Mining
Provides nodes and infrastructure to create, process, analyze and view information networks within KNIME. This feature is available via the KNIME Labs update site.
Text Processing
The KNIME text processing feature enables to read, process, mine and visualize textual data in a convenient way. This feature is also available via the KNIME Labs update site.
Palladian is a Java-based toolkit which provides functionality to perform typical Internet Information Retrieval tasks. The feature is part of the KNIME community contributions and is available via the KNIME the community contribution update site.