Vernalis Cheminformatics Nodes

 

Detailed node descriptions

Below are details of the nodes provided by the Vernalis trusted community contribution.  The nodes and categories are described in the order in which they appear in the Node Repository browser

 

European PubMed Central

 European PubMed Central Advanced Query

This node provide an interface that emulates the European PubMed Central mirror advanced query page.  Results are returned as XML Cells which can be further processed with the XPath node.

Flow Control

Flow Variable

This group of nodes adds switching capability analogous to that for data tables to flow variable ports.  The nodes are all cross-compatible with the data table versions, e.g. an active branch started witha flow variable CASE Switch node can end with a datatable End CASE node and vice versa.  It is important when closing an inactive branch with a flow variable end if/case node to pay careful attention to the flow variable behaviour - It is important to note that at the present time, flow variables present before the start of an inactive branch will retain their values after the end node if there is an inactive branch 'higher' up the inport side of these nodes than the first active branch.  This is due to a known feature within the KNIME core. Therefore, if you need the values of flow variables generated before one of these nodes but after the corresponding IF or CASE Switch node, either ensure that it has a new name, or use a variable to tablerow node to convert the variable value to a data table cell, and core End IF/End CASE node.

 IF/CASE Switch nodes for flow variables

These nodes provied flow variable equivalents of the IF Switch and CASE Switch nodes in the core KNIME product, and give the same functionality for controlling active branches.  The nodes are cross-compatible with the core End IF/End Case nodes - i.e a branch starting with a flow variable IF Switch can end with a core data table End IF node.

 IF Switch (Flow Variable Value) node for flow variables

This node allows the user to select an active flow variable branch by performing comparison of a flow variable value with a specified value using a variety of operators (>, >=, =, <=, <, !=).  If the condition is 'true' then the top port is active, otherwise the bottom port is active.  The timed nodes page contains an example of the use of this node.

 

Again, the node is cross-compatible with the core End IF/End Case nodes - i.e a branch starting with a flow variable IF Switch can end with a core data table End IF node.

 End IF/End CASE nodes for flow variables

These nodes provide the corresponding 'End' nodes for terminating inactive branches through flow variable ports.  The nodes are cross-compatible with the core IF/CASE Switch nodes - i.e a branch starting with a core IF Switch can end with a flow variable End IF node. It is important to note that at the present time, flow variables present before the start of an inactive branch will retain their values after the end node if there is an inactive branch 'higher' up the inport side of these nodes than the first active branch.  This is due to a known feature within the KNIME core. Therefore, if you need the values of flow variables generated before one of these nodes but after the corresponding IF or CASE Switch node, either ensure that it has a new name, or use a variable to tablerow node to convert the variable value to a data table cell, and core End IF/End CASE node.  The timed nodes page contains an example of the use of this node.

 

Database

Again, these nodes as switching capability to the database port type (NB - this is the older database connection port type, which contains both connection details and a data table).  Again, the nodes are fully cross-compatible with the other switching types.

 

 IF/CASE Switch nodes for database ports

These nodes provieddatabase equivalents of the IF Switch and CASE Switch nodes in the core KNIME product, and give the same functionality for controlling active branches.  The nodes are cross-compatible with the core End IF/End Case nodes - i.e a branch starting with a database IF Switch can end with a core data table End IF node.

 

 IF Switch (Flow Variable Value) node for database ports

This node allows the user to select an active database port branch by performing comparison of a flow variable value with a specified value using a variety of operators (>, >=, =, <=, <, !=).  If the condition is 'true' then the top port is active, otherwise the bottom port is active.

 

Again, the node is cross-compatible with the core End IF/End Case nodes - i.e a branch starting with a database IF Switch can end with a core data table End IF node.

 

 End IF/End CASE nodes for database ports

These nodes provide the corresponding 'End' nodes for terminating inactive branches through databaseports.  The nodes are cross-compatible with the core IF/CASE Switch nodes - i.e a branch starting with a core IF Switch can end with a database End IF node.

 

Loops

The nodes in this section add additional looping functionality to KNIME, extending on those in the core product.

Timed Loops

These loops allow for a loop to execute either until a certain time period has elapsed ('Run-for-time' nodes) or until a certain time has passed ('Run-to-time' nodes).  The loop start nodes can be used with the conventional loop end nodes, or with the Timed Loop End nodes, which make any unprocessed rows remaining at the end of the loop available in an additional out port, along with flow variables indicating the time at which loop execution stopped.  Additionally, the loop start nodes allow the user to set the initial value for the loop iteration counter - this does not affect the rows processed, but allows the user to manually set a different start value for the counter.  See thetimed nodes page example for further details and examples.

 

Chunk Loop Run-to-Time/Run-for-Time Loop Start

These nodes provide a timed version of the KNIME core product Chunk Loop start.  The input table is passed into the loop in 'chunks', until the specified time criteria has been met or there are no more input rows available.

 

Table Row To Variable Run-to-Time/Run-for-Time Loop Start

These nodes provide a timed version of the KNIME core product Table Row to Variable loop start.  The input table is passed into the loop row-wise, converting the datacells to flow variables, until the specified time criteria has been met or there are no more input rows available.

 

Timed Loop End (1 port / 2 ports / 3 ports / upto 4 ports)

These loop end nodes will only work if paired with one of the above timed loop start nodes.  In addition to the output tables generated during the loop (in the standard KNIME loop end 'concatenate' style), a final port makes available any unprocessed rows due to the loop terminating for timing reasons before the input table has been fully processed.  The 'upto 4 ports' variant allows for optional input ports - the behaviour of the corresponding output ports depends on the user setting (either an empty, columnless table results, or an inactive branch)

 

 Variable Timed Loop End

This node provides a Timed Loop End equivalent of the core KNIME Variable Loop End node, which collects flow variables from loop iteration and outpus them as a data table.  As for the other Timed Loop End nodes, an additional output port returns rows not processed during loop execution.

 

 

Loop End (3 ports / upto 4 ports / upto 6 ports)

These are standard Loop End nodes, which extend the KNIME core product Loop End (2 ports) to 3, upto 4 or upto 6 ports.  As for the corresponding Timed Loop End variant, fhe 'upto 4 ports' and 'upto 6 ports' variants allow for optional input ports - the behaviour of the corresponding output ports depends on the user setting (either an empty, columnless table results, or an inactive branch).

 

Delays

These two nodes allow a time delay to be introduced before subsequent nodes are executed.  This might be useful either to allow a workflow to wait until a "quiet time" (eg evenings, lunchtimes) before executing a particularly processor-intensive section, or to allow a wait to allow reconnection to a server.  The settings are very similar as for the Timed Loop Start nodes.  Again, the timing nodes example page may be helpful. The nodes are only provided with flow variable ports, as this allows them to be connected between any 2 nodes via the hidden 'Mickey Mouse ear' flow variable ports on those nodes.  The end time of node execution is revealed as a flow variable in the nodes output.

 

Wait-for-time

This node waits until a specified time period has elapsed.

 

 Wait-to-time

This node waits until a specified time has been reached.   The timed nodes page contains an example of the use of this node.

 

 

 IF Switch (Flow Variable Value)

This node provides simple switching based on flow variable values to a standard data table port type, by comparison of a flow variable value with a specified value using a variety of operators (>, >=, =, <=, <, !=).  If the condition is 'true' then the top port is active, otherwise the bottom port is active.

 

Again, the node is cross-compatible with all the other datatype End If/End CASE nodes.

 

IO

The nodes in this category provide flow variable port equivalents of the core KNIME read/write Table nodes, allowing flow variable values to be saved to and loaded from disk.  The nodes currently accept local file paths or URIs in the form 'file:/.....', but not the 'knime:' relative paths.   The timed nodes page contains an example of the use of these nodes.

 

Read Variables

The variables will be read from the *.variables file at the specified location.  The node offers a number of options for how variables of the same name from disk and existing within the workflow should be handled.

 

Write Variables

The variables will be saved to disk at the specified location.  If the file path does not exist, the node will attempt to create it. "Special" variables (i.e. those whose name starts 'knime.') will be ignored.

 

Matched Molecular Pairs (MMPs) (UPDATED)

These nodes provide access to the Hussain and Rea algorithm for Matched Molecular Pair generation (see  Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large datasets , J. Chem. Inf. Model. , 2010, 50 , 339-348; DOI: 10.1021/ci900450m).  The nodes use the RDKit toolkit to perform the fragmentation step, and allow upto 10 'cuts' to be made.  A number of pre-defined cut types are available, and there is an option to specify a custom reaction SMARTS pattern.  H -> R changes are available when 1 cut is made by selecting the 'Add H's prior to fragmentation' option in the advanced settings tab.  No activity data is required to be associated with the compounds

Filtering

The nodes in this category provide filtering functionality.

 

 MMP Molecule Filter

MMP Molecule Splitter

These nodes do not perform any fragmentation, but simply filter those molecules which can be fragmented according to the supplied fragmentation schema.  Molecules which cannot be parsed by RDKit will also be removed.

Fragmentation

The nodes in this category provide fragmentation capability based on the Hussain / Rea algorithm

 

 MMP Molecule Fragment (3rd Gen)

MMP Molecule Multi-cut Fragment (3rd Gen)

In this node, only the fragmentation step is carried out, generating 'key-value' pairs (in which the 'key' is the constant, unchanging part, and the 'value' the changing part).  The output can then be converted to matched pairs using the Fragments to MMPs node.  Using this node, fragmentations can be stored, and new pairs resulting from new molecules can be generated without re-fragmenting the entire set.  The node has been heavily re-worked to fix memory issues and performance issues.  The multi-cut node performs all cuts from 1 to the number specified.

Pair Generation

The node in this category converts fragmented key/value pairs to Matched Molecule Pairs.  More nodes will be added to this category in future versions

 

 Fragments to MMPs

This node converts the key-value pairs generated by the MMP Molecule Fragment node to Matched Molecular Pair transforms.  The node has been updated providing additional options, and will work on tables with mixed numbers of cuts

 

 MMP Calculate Maximum Cuts

This node calculates the maximum number of cuts which can be made for a given input molecule according the specified cut type

 

 MMP Fragmentation Type Loop Start

This node loops through one or more of the pre-defined bond match cut types, providing the input table unchanged at the output port, and the cut type as a flow variable, allowing a table to be cut according to multiple types

 

 Uniquify IDs

This node uniquifies molecule IDs prior to fragmentation. 

Principal Moments of Inertia (PMIs)

 

 Align to Inertial Principal Axes

This node aligns molecules to the inertial reference frame (or principal axes)

 

 PMI Calculation

This node calculate the Principal Moments of Inertia (PMIs; I1, I2, I3) and the normalised PMIs (nPMIs; npr1, npr2, I1/I3, I2/I3)

 

RCSB PDB Tools

 

 PDB Connector [1]

This source node emulates the full advanced query and reporting features presented on the PDB website (www.rcsb.org) through the RESTful web services interface.  It provides two output tables:

  • A list of the PDB Structure IDs resulting from the query
  • The Reports generated in the format specified by the query for those structures.

The node now uses the POST reporting webservice by default, and returns the XML query generated in the node in a flow variable, which can be used to generate a second report format in the new PDB Connector (XML Query String) node

  PDB Connector (XML Query String)

This source node replicates the functionality in the PDB Connector node, but allows the user to supply the PDB XML query directly - either via a flow variable (from a preceding PDB Connector or PDB Connector (XML Query String node)), or by pasting from the XML shown in the "Query Details" link on the PDB website (www.rcsb.org).  See here for an example of how to do this.

 

PDB Downloader / PDB Downloader (Source)

Two nodes which download files associated with the provided PDB Structure ID(s) from the PDB (www.rcsb.org)  to the appropriate column types in the output KNIME table.  The "(Source)" version requires the PDB Structure ID(s) to be entered manually in the configuration dialogue, whereas in the other ('manipulator') version a column is selected from the input table.  Available file formats are PDB, mmCIF, Structure Factors, PDBML/XML ("PDBx") and FASTA.

 

Local PDB Tools

 

PDB Loader

Load local copies of PDB format files into a KNIME table as PDB cell types.  The user selects a column in the input table containing paths to the files, and enters a name for the column containing the loaded files in the output table.

 

PDB Saver

Node to save a column of PDB cell type within a KNIME table to local PDB format files.  Options are provided to overwrite or skip files which already exist, and a Boolean Cell column is added to the output table to indicate whether or not the file was successfully written.

 

 PDB Property Extractor

Node to extract various properties from a PDB cell type column and append them to the user table.  The properties extracted can be selected in the configuration dialogue, and are as follows: PDB ID, Title, Experimental Method, Resolution, Number of Models, R, RFree, Space Group and REMARKS 1 - 3.

 

Sequence Tools

 

 FASTA Sequence Extractor

This node extracts the sequences for all chains listed in the FASTA file.  For multi-chain FASTA files, a new row will be added for each chain.  A number of columns will be added according to the source type selected in the drop-down (See the Node Description for details).  FASTA Files can be retrieved for PDB entries using the PDB Downloader nodes.  No sequence parsing is implemented, and the processing is type-agnostic (protein, nucleotide etc).

 

 PDB Sequence Extractor

This node extracts sequences for all the chains in a PDB Cell column.  Where multiple models are present, these can be separated.  Sequences can be derived from the PDB SEQRES block, and / or the co-ordinates blocks, and can be returned as 1 or 3-letter codes.  Amino-acids and nucleotides can optionally be "sanitized" to their parent amino acid (e.g. phosphoserine -> serine, D-serine -> serine) or RNA nucleotide.

 

Fingerprints

 

Fingerprint Properties

Extracts the fingerprint cell type (SparseBitVector or DenseBitVector), the fingerprint length (i.e. total number of bits in the fingerprint) and the fingerprint cardinality (i.e. the number of set bits in the fingerprint).

 

Miscellaneous

Text-based IO

 Text-based load/save nodes

These two nodes allow loading and saving of multiple text-based files (e.g. .txt, .py, .java, .mol, mol2 etc) into a workflow table string column (1 file per table row) given a row of filepaths or URLs.  The entire file is loaded into a single table cell.

 

 List Folders

This source node complements the ‘List Files’ node provided in the core KNIME package.  It allows the user to specify one or more locations for searching, and lists all the folders (or directories) contained within that location.  Optionally, subfolders are also searched.  The node was updated in version 1.1.3 to optionally return the path/URL of the containing folder, the name of the folder, and the visibility and last modified data/time of the folder.

 Random Numbers Generator

This source node generates an output table with a single column containing the specified number of random numbers.  The configuration dialogue allows the user to specify the column name, along with the number of values to be generated, the minimum and maximum values, whether the values are to be integers or doubles, and whether they are to be unique or not.

 

SMARTSViewer

Provides a SMARTSviewer visualisation of a column of SMARTs strings (SMARTS or Smiles cell types accepted as inputs) using the service provided by the University of Hamburg (http://www.smartsview.de/).  The user selects a column in the input table and these are posted (unencrypted) to the remote webservice.  The output is return as a PNGCell column appended to the input table.

Testing

Benchmarking

The Benchmarking nodes allow timing of the execution of a node or nodes in a benchmarking loop.  Optionally, multiple loop executions can be run, and a time limit set after which no further executions occur.  The benchmarking end nodes collect the individual iteration timings, and provide summary data in flow variables, along with the last iteration input tables.

Currently, the benchmarking nodes are only available with 'standard' buffered data table ports.  If you need something else, please let us know!

 

 Benchmark Start / Benchmark Start (2 ports/3 ports)

The start nodes pass through the input tables unchanged, and start the clock for timing execution of subsequent nodes.  The user can specify multiple loop executions and an optional timeout.

 

 Benchmark end / Benchmark End (2 ports/3 ports)

The end nodes gather the timings of each individual loop iteration, and supply them as the first output data port.  Additionally, the fastest, slowest, average and last execution time are provided as flow variables.  All flow variables present within the loop body are passed out of the loop by the loop end.  (Incoming benchmarking statistics - e.g. from a nested or preceding benchmarking loop - are overwritten).

 

 

 

 

Release Notes

Release notes and changelog can be found here

Source Code

The source code can be accessed at https://anonymous:knime@community.knime.org/svn/nodes4knime/trunk/com.vernalis.

License

The Vernalis nodes are released under GPLv3.

 

[1]Node developed in conjunction with Enspiral Discovery Ltd, and previously released on both Vernalis and Enspiral Discovery websites.