KNIME logo
Contact usDownload

How KNIME helps identify new drug candidates for COVID-19

Life Sciences (Pharma & Biotech)R&DDiagnostics & Drug Discovery
Uppsala University
Remdesivir identifiedplus other drug candidates
Automated retrievalof data from chemical databases
Standardizationof molecular structures automated

A year of pandemic: Identifying novel candidate molecules with COVID-19 as use case

The current timeline for a new drug to get regulatory approval ranges between 12 to 15 years. To accelerate the process of novel drug discovery, Uppsala University, Sweden’s leading public research university, added drug repurposing strategies to its pipeline to find drugs effective for COVID-19 treatment. One of the drugs identified using this approach was remdesivir which continues to be actively used to treat COVID-19 patients with the highest risk of becoming seriously ill.

Drug repurposing (also known as drug repositioning) is a re-evaluation of an already existing drug to test its potential in the treatment of a novel disease. This approach significantly reduces research and development costs and the drug development timeline as all the existing compounds in the drug have already been tested for safety in humans, eliminating the need for Phase 1 clinical trials.

While drug repurposing is an effective method to speed up drug development, it comes with a unique set of challenges due to its data-intensive nature. To start with, researchers need to combine a wide variety of data entities such as tissue expression data, genes, drug-target interactions, disease datasets, and phenotype data to understand the effect a drug produces in the body. Combining data from multiple databases is key to increasing the diversity of the physico-chemical properties of the final compound sets. Additionally, working with small molecule data also poses several challenges such as an abstract chemical representation and the heterogeneity of data formats in the public domain.

All of this complex and diverse chemical data then requires large-scale data analysis to reveal hidden patterns for early stage drug development. This involves advanced data analytics techniques such as text mining, network analyses, machine (deep) learning models to predict drug-target-disease relationship, and structure- (protein-) based modeling methods.

With its ability to easily integrate data from multiple sources and its advanced analytics capabilities along with the advantages of automation flexibility, re-usability, and transparency, KNIME was the perfect fit for Uppsala University’s requirements to evaluate COVID-19 drug candidates.

Uppsala University's strategy was based on the molecular similarity principle: structurally similar molecules tend to possess similar biological activities. To this end, they built a workflow in KNIME to perform ligand-based in silico drug repurposing.

Programmatic data access to life sciences databases

With KNIME, researchers at the university could programmatically connect with various publicly available chemical databases such as UniProtProtein Data Bank (PDB)ChEMBLGuide-To-Pharmacology (IUPHAR)PubChem, and DrugBank to access structural- and bioactivity-ligand data associated with protein targets that are involved in COVID-19.

Instead of manual querying the databases, they simply had to execute a workflow in KNIME that specified an API request, retrieved data from web services, and extracted relevant information from the received files. In the first instance, protein target identifiers of the Open Targets platform were mapped to their corresponding UniProt IDs. These retrieved UniProt IDs served as a starting point to retrieve protein–ligand structural data from PDB, as well as ligand bioactivity data from ChEMBL, IUPHAR, and PubChem.

Standardizing molecular Sstructures

A prerequisite for merging ligand data from diverse sources is to standardize molecular structures. To ensure unified chemical data representation, researchers only had to execute another workflow in KNIME. The workflow took care of the complex steps of removing compound stereochemistry, stripping salts by forwarding a predefined set of different salts/salts mixtures, and listing all stripped salt components in a clear output table. Next, the workflow neutralized charges and checked for possible atomic clashes, filtered data by checking specific elements, and generated the necessary InChI, InChiKey, and Canonical smiles formats.

Substructure searches to identify candidates for drug repurposing

Once the molecular structures were standardized, the research team used machine learning models in KNIME to identify enriched molecular (sub)structures in order to perform substructure searches of the datasets of available drugs for finding new - potentially active - drug candidates.

To achieve this, first, the molecular structures were reduced to their Bemis-Murcko scaffolds.Then, using the generated scaffolds and associated UniProt IDs as an input, KNIME helped them calculate the molecular distances for the retained scaffolds. The clustering model in KNIME created hierarchical clusters to group the scaffolds. The output of the model included UniProt IDs, associated scaffolds, and cluster IDs. For each target, researchers used KNIME to loop over distinct clusters of associated scaffolds to create a maximum common substructure.

The generated substructures (in SMARTS) were then used as queries to find hits in DrugBank. The structures from DrugBank were standardized and filtered to perform substructure searches. The identified hits were provided in a table with molecule names, associated targets, SMARTS keys, and chemical structures along with highlighted substructures in SVG format.

Researchers at Uppsala University were able to automate the entire process using KNIME. Since a KNIME workflow is easily reproducible, they could adapt it to individual project needs.

Remdesivir and other drug candidates for COVID-19

The application programming interface and the integration of cheminformatics libraries and external software in KNIME enabled Uppsala University to tackle the problems in a clear, efficient, and reproducible way.

Substructure searches with KNIME helped identify 7836 compounds from DrugBank and 36,521 compounds from the CAS data set. Out of those hits, 135 compounds were retrieved from both DrugBank and the CAS data set. Some of the identified drugs include remdesivir, fludarabine, riboprine, rupintrivir, indinavir, darunavir, and telaprevir. Several of the identified hits underwent clinical trials with remdesivir receiving emergency approval for COVID-19 treatment from regulatory authorities across different countries.

More success stories

Success Story


How KNIME helped Alexion shorten time to disease diagnosis & accelerate time to treatment

Success Story


How Diaceutics automated and streamlined data labeling to increase speed to insights

Success Story


How Chiesi automated the physico-chemical property calculation of 50,000 + compounds