What is TeachOpenCADD KNIME and how does it differ from Jupyter Notebooks?

TeachOpenCADD KNIME is a visual alternative to Jupyter Notebooks for learning computer-aided drug design. It consists of eight interconnected workflows that provide a more visual approach suitable for users without programming backgrounds, while Jupyter Notebooks require basic coding expertise.

What are the eight workflows included in the TeachOpenCADD KNIME pipeline?

The eight workflows cover: W1 - Acquiring compound data from ChEMBL, W2 - Filtering datasets by ADME criteria, W3 - Setting alerts for unwanted substructures, W4 - Screening compounds by similarity, W5 - Grouping compounds by similarity, W6 - Finding maximum common substructures, W7 - Screening using machine learning methods, and W8 - Acquiring structural data from PDB.

What are the system requirements for running TeachOpenCADD workflows?

The workflows have been updated to run on KNIME v5.2 (as of June 2024). They require RDKit KNIME integration by NIBR and Vernalis KNIME nodes by Vernalis Research, in addition to some extensions provided by the KNIME team.

Where can I download the TeachOpenCADD KNIME workflows?

The complete bundled project including all workflows is freely available on KNIME Hub. Individual workflows can also be downloaded separately from the Hub's Life Sciences/Cheminformatics/Teaching section.

What target protein is used as an example in the TeachOpenCADD tutorials?

The pipeline uses the epidermal growth factor receptor (EGFR) as the default example target, but the workflows can easily be applied to other targets of interest for drug design research.

Tutorials for Computer Aided Drug Design in KNIME

Update: These workflows have been updated (June 2024) to run on KNIME v5.2

Jupyter Notebooks offer an incredible potential to disseminate technical knowledge thanks to its integrated text plus live code interface. This is a great way of understanding how specific tasks in the Computer-Aided Drug Design (CADD) world are performed, but only if you have a basic coding expertise. While users without a programming background can simply execute the code blocks blindly, this rarely provides any useful feedback on how a particular pipeline works. Fortunately, more visual alternatives like KNIME workflows are better suited for this kind of audience.

In this blog post we want to introduce our new collection of tutorials for computer-aided drug design (Sydow and Wichmann et al., 2019). Building on our Notebook-based TeachOpenCADD platform (Sydow et al., 2019), our TeachOpenCADDKNIME pipeline consists of eight interconnected workflows (W1-8), each containing one topic in computer-aided drug design.

Our team put together these tutorials for (a) ourselves as scientists who want to learn about new topics in drug design and how to actually apply them practically to data using Python/KNIME, (b) new students in the group who need a compact but detailed enough introduction to get started with their project, and (c) for the classroom where we can use the material directly or build on top of it.

Fig. 1: The visual capabilities of the KNIME Platform are evident. This is not a diagram of the TeachOpenCADD KNIME workflows, but the actual project as rendered in KNIME itself. Each box can be accessed individually for further configuration and workflow details.

The pipeline is illustrated using the epidermal growth factor receptor (EGFR), but can easily be applied to other targets of interest. Topics include how to fetch, filter and analyze compound data associated with a query target. The bundled project including all workflows is freely available on KNIME Hub. The Hub also lists the individual workflows for separate downloads if desired. Further details are given in the following sections.

Note: The screenshots shown below are taken from the individual workflows, which resemble the complete workflow but have different input and output sources. Double click the screenshots to see a larger display of the image.

Workflow 1: Acquire compound data from ChEMBL

Information on compound structure, bioactivity, and associated targets are organized in databases such as ChEMBL, PubChem, or DrugBank. Workflow W1 shows how to obtain and preprocess compound data for a query target (default target: EGFR) from the ChEMBL web services.

Workflow 2: Filter datasets by ADME criteria

Not all compounds are suitable starting points for drug development due to undesirable pharmacokinetic properties, which for instance negatively affect a drug's absorption, distribution, metabolism, and excretion (ADME). Therefore, such compounds are often excluded from data sets for virtual screening. Workflow W2 shows how to remove less drug-like molecules from a data set using Lipinski's rule of five.

Workflow 3: Set alerts based on unwanted substructures

Compounds can contain unwanted substructures that may cause mutagenic, reactive, or other unfavorable pharmacokinetic effects or that may lead to non-specific interactions with assays (PAINS). Knowledge on unwanted substructures in a data set can be integrated in cheminformatics pipelines to either perform an additional filtering step before screening or - more often - to set alert flags to compounds being potentially problematic (for manual inspection by medicinal chemists). Workflow W3 shows how to detect and flag such unwanted substructures in a compound collection.

Workflow 4: Screen compounds by compound similarity

In virtual screening (VS), compounds similar to known ligands of a target under investigation often build the starting point for drug development. This approach follows the similar property principle stating that structurally similar compounds are more likely to exhibit similar biological activities. For computational representation and processing, compound properties can be encoded in the form of bit arrays, so-called molecular fingerprints, e.g. MACCS and Morgan fingerprints. Compound similarity can be assessed by measures such as the Tanimoto and Dice similarity. Workflow W4 shows how to use these encodings and comparison measures. VS is here conducted based on a similarity search.

Workflow 5: Group compounds by similarity

Clustering can be used to identify groups of similar compounds, in order to pick a set of diverse compounds from these clusters for e.g. non-redundant experimental testing or to identify common patterns in the data set. Workflow W5 shows how to perform such a clustering based on a hierarchical clustering algorithm.

Workflow 6: Find the maximum common substructure in a collection of compounds

In order to visualize shared scaffolds and thereby emphasize the extent and type of chemical similarities in a compound cluster, the maximum common substructure (MCS) can be calculated and highlighted. In Workflow W6, the MCS for the largest cluster from previously clustered compounds (W5) is calculated using the FMCS algorithm.

Workflow 7: Screen compounds using machine learning methods

With the continuously increasing amount of available data, machine learning (ML) gained momentum in drug discovery and especially in ligand-based virtual screening to predict the activity of novel compounds against a target of interest. In Workflow W7, different ML models (RF, SVM and NN) are trained on the filtered ChEMBL dataset to discriminate between active and inactive compounds with respect to a protein target.

Workflow 8: Acquire structural data from PDB

The PDB database holds 3D structural data and meta information on experimentally resolved proteins. Workflow W8 shows how structural data can be automatically fetched from the PDB and processed.

Requirements

All the workflows have been tested on KNIME v4 and v4.1. In addition to some extensions provided by the KNIME team, TeachOpenCADD also requires:

RDKit KNIME integration, by NIBR
Vernalis KNIME nodes, by Vernalis Research.

For a full list of requirements, please check our project on the KNIME Hub.

Update: the workflows have been updated to run on KNIME v5.2

Feedback

If you are using the workflows and you like them, drop us a line at this thread in KNIME forums or in the issues section of the TeachOpenCADD repository.