Life Sciences

How & why a cheminformatician built new functionality with KNIME & Python

May 26, 2023 — by Takayuki Serizawa

Standardizing compound structure data with KNIME & Python

Cheminformaticians deal with humongous volumes of chemical data. Before we can analyze the data and extract useful information, it has to be standardized in order to be stored accurately and consistently. This painfully slow and cumbersome process is made a lot easier and faster with tools like the ChEMBL structure pipeline, which automate normalizing molecules. 

But in multi-discplinary teams not everyone codes. That’s why open data science tools – which allow everyone to collaborate and share solutions easily and flexibly – are so useful: they let us all benefit. 

At the recent KNIME Spring Summit, I attended an online workshop about building custom functionality in KNIME with Python. The functionality I have with the ChEMBL structure pipeline isn’t available in KNIME Analytics Platform, so I wanted to see if I could develop a custom node in KNIME that would clean my chemical compounds data using the Python ChEMBL library.

I wrote about how I built the node on my blog, which we’ve reposted below. My new ChEMBL structure pipeline node takes SMILES strings as an input, standardizes molecules from SMILES and generates a molecular hash as an output.

How a cheminformatician used KNIME and Python
Fig. 1: The workflow at the top of the screenshot SMILES strings as an input, standardizes molecules from SMILES, and generates a molecular hash as an output

To sum up my first impression of developing a KNIME node in Python, I would like to mention three main benefits:

1. Easier managing of Python dependencies

Sometimes keeping track of Python environments for managing Python dependencies can be a pain, especially when you want to share work with colleagues. I find KNIME’s Conda Environment Propagation useful, as it snapshots details of the environment you’re using and propagate them to any new execution location.

2. Easier sharing of logic

For cheminformaticians who write code, like me, my new ChEMBL structure pipeline node can be managed by a version management system like github. I can share the logic with my colleagues and they can easily modify the code.

3. Easier sharing of functionality

Colleagues who use KNIME but are non-coders can now add my node to their own workflows, or even download the entire workflow to use themselves.

Share and Improve Functionality with the Open Source Community

As a passionate RDKit user, I appreciate how open source communities develop and share new functionality. KNIME has a nice user community. I believe that sharing these kinds of activities promote the “give and take” philosophy. When I share the node on the KNIME Community Hub, other people can download it and use it themselves. But sharing functionality isn’t only giving. It can also mean receiving useful feedback from the community, which helps to improve our solutions. I like this structure.

Read on to see my article, originally published on my blog Is life worth living.

Developing a new KNIME node with Python

I recently watched the KNIME Spring Summit to hear more about the new features in the software. It’s really cool. I’m interested in node development with Python. In previous versions of KNIME, you can develop your own nodes, but this required using Java. Now we can develop custom functionality with Python instead of Java.

This is supported from KNIME version 4.6. You can read about the details described in the article, Four Steps for Your Python Team to Develop KNIME Nodes.

I read the blog post and set about developing my own KNIME cheminformatics node. I wanted to build a node to standardize molecules with the chembl_structure_pipeline. This library is really useful for normalizing molecules.

The following section shows my log.

At first, I got template code( from here.

The structure of the zip file is below.

(base) iwatobipen@penguin:~/dev/knime_dev/basic$ tree
├── config.yml
├── Example_with_Python_node.knwf
├── my_conda_env.yml
└── tutorial_extension
    ├── icon.png
    ├── knime.yml

I modified config.yml and my_conda_env.yml below.

org.tutorial.first_extension: # {group_id}.{name} from the knime.yml
  src: /home/iwatobipen/dev/knime_dev/basic/tutorial_extension # Path to folder containing the extension files
  conda_env_path: /home/iwatobipen/miniconda3/envs/my_python_env # Path to the Python environment to use
  debug_mode: true # Optional line, if set to true, it will always use the latest changes of execute/configure, when that method is used within the KNIME Analytics Platform

name: my_python_env
  - knime
  - conda-forge
  - python=3.9
  - knime-extension=4.7
  - knime-python-base=4.7
  - rdkit
  - chembl_structure_pipeline

How to define the config.yml is well documented in the blog article, 4 Steps for your Python Team to develop KNIME nodes.

After defining the my_conda_env.yml, I made conda env with the yml-file.

$ conda env create -f mt_conda_env.yml

After making the env, I wrote the code for the KNIME node. My node get smiles strings as an input then standardize molecules from SMILES and generate molecularhash as an output.

The code is below. Decorator is used for making input and output. The following code defines one input port and one output port. You can add additional ports with @knext.input_table and @knext.output_talbe decorators ( ).

import logging
import knime.extension as knext
from rdkit import Chem
from rdkit.Chem import rdMolHash
from functools import partial
from chembl_structure_pipeline import standardize_mol
from chembl_structure_pipeline import get_parent_mol
LOGGER = logging.getLogger(__name__)
#molhash = partial(rdMolHash,MolHash(rdMolHash.HashFunction.HeAtomTautomer))
@knext.node(name="chembl structure pipeline", node_type=knext.NodeType.MANIPULATOR, icon_path="demo.png", category="/")
@knext.input_table(name="SMILES column", description="read smiles")
@knext.output_table(name="Output Data", description="rdkit mol which is standarized with chembl structure pipeline")
class TemplateNode:
    """Short one-line description of the node.
    This is sample node which is implemented with chembl structure pipeline.
    input data should be SMILES.
    # simple code
    def std_mol(self, smiles):
        mol = Chem.MolFromSmiles(smiles)
        if mol == None:
            return None
            stdmol = standardize_mol(mol)
            pm, _ = get_parent_mol(stdmol)
            return pm
    def get_mol_hash(sel, rdmol):
        res = rdMolHash.MolHash(rdmol, rdMolHash.HashFunction.HetAtomTautomer)
        return res
    column_param = knext.ColumnParameter(label="label", description="description", port_index=0)
    def configure(self, configure_context, input_schema_1):  
        #return input_schema_1.append(knext.Column(Chem.rdchem.Mol, "STD_ROMol"))
        return input_schema_1.append(knext.Column(Chem.rdchem.Mol, "STD_ROMol")).append(knext.Column(knext.string(), 'MolHash'))
    def execute(self, exec_context, input_1):
        input_1_pandas = input_1.to_pandas()
        input_1_pandas['STD_ROMol'] = input_1_pandas['column1'].apply(self.std_mol)
        input_1_pandas['MolHash'] = input_1_pandas['STD_ROMol'].apply(self.get_mol_hash)
        return knext.Table.from_pandas(input_1_pandas)

After writing the code, add this one line to knime.ini which is located in knime install folder: “-Dknime.python.extension.config=/home/iwatobipen/dev/knime_dev/basic/config.yml”.

Next, I launched KNIME and could see my own KNIME node. You can see the simple workflow I made in figure 1, above, and see how it shows my newly developed ChEMBL structure pipeline node. I added some SMILES from the Table Creator node. Now, when I run the node, I get standardized molecules as the output.

The workflow not only standardizes molecules but also generates molhash. You can see the output below. Count row is count of molhash. It can see count 2 in 2-hydroxy pyridine and pyridone. Of course they are tautomer.

How a cheminformatician used KNIME and Python
See the output of my workflow: From SMILES to standardized molecules.

Developing this new node with Python was an useful experience for me. And I can now share the functionality with non-coders!

You Might Also Like
Life Sciences

Tutorials for Computer Aided Drug Design in KNIME

Jupyter Notebooks offer an incredible potential to disseminate technical knowledge thanks to its integrated text plus live code interface. While users without a programming background can simply execute the code blocks, this rarely provides any useful feedback on how a particular pipeline works. Visual alternatives like KNIME workflows are better suited for this kind of audience.

January 24, 2021 – by Dominique Sydow &  Andrea Volkamer &  Jaime Rodríguez-Guerra
Predictive Analytics
Drug Discovery

How KNIME Helps Identify New Drug Candidates

A Year of Pandemic: Identifying Novel Candidate Molecules with COVID-19 as Use Case The current timeline for a new drug to get regulatory approval ranges bet...

What are you looking for?