Cheminformatics Workflow performs substructure count on all molecules in a training set , uses these counts to build a classification model for all groups as per a particular nominal column in a loop. The ML models are saved to disc.
I now want to do a substructure search on a test set and run ML model on millions of molecules. Since the number of nominal classes are a few hundred, it takes a few minutes for the prediction to complete for 1 molecule. So what would be the best way to transform this predictive workflow into a high throughput one?
Idea is to screen say 10 million molecules within a reasonable time frame of a few weeks at most if not days. Is it even possible in Knime, either using free nodes or some commercial license?
Other option is to convert it to hadoop /big data but that's a separate project in itself. How can I convert a cheminformatics knime wf to a high throughput one?
If I understand correctly that would require around 10,000,000 minutes (@ 1 cpd/min) / 60 (mins) / 24 (hours) ~ 7000 CPU days of compute time. So to finish the computation in 1 week you'd want to split the calculation on 1000 CPUs.
One option would be to use the KNIME cluster executor extension (if you have access to an SGE based cluster).
Equally you may want to first look into how well you can optimize your workflow. I'd recommend to try to minimize the use of loops, and to use the KNIME Streaming Executor.
A substantial part of the workflow is accounted for byi Indigo substructure count node and then the randomforest or bayesian model nodes.
There was some work by Rajarshi Guha using hadoop for substructure search which may be an option along with machine learning algos.
Thanks a lot for your ideas and reply.