The benefits of using predictive analytics is now a given. In addition, the Data Scientist who does that is highly regarded but our daily work is full of contrasts. On the one hand, you can work with data, tools and techniques to really dive in and understand data and what it can do for you. On the other hand, there is usually quite a bit of administrative work around accessing data, massaging data and then putting that new insight into production - and keeping it there.
In fact, many surveys say that at least 80% of any data science project is associated with those administrative tasks. One popular urban legend says that, within a commercial organization trying to leverage analytics, the full time job of one data scientist can be described as building and maintaining a maximum of four (yes 4) models in production - regardless of the brilliance of the toolset used. There is a desperate need to automate and scale the modelling process, not just because it would be good for business (after all, if you could use 29000 models instead of just 4, you would want to!) but also because otherwise we data scientists are in for a tedious life.
At the recent KNIME Spring Summit in Berlin, one of the most well received presentations was that of the KNIME Model Process Factory, designed to provide you with a flexible, extensible and scalable application for running and monitoring very large numbers of model processes in an efficient way.
The KNIME Model Factory is composed of a white paper, an overall workflow, tables that manage all activates and a series of workflows, examples and data for learning to use the Factory.
Video 1. The Model Factory in Action! Here is the orchestrating workflow triggering dependent workflows during execution.
A few highlights include:
- Workflow Orchestration. A workflow acts as the art director of the whole process, by organizing, monitoring, triggering, and automating – that is by orchestrating - all workflows involved in the model process factory.
- Model Monitoring. The KNIME Model Factory includes a number of workflows for initializing, loading, transforming, modeling, scoring, evaluating, deploying, monitoring, and retraining data analytics models.
- Reuse Best Practices. The workflows and the whitepaper also show the common best practice for packaging sub-workflows for quick, controlled, and safe reuse by other workflows
- Call Remote Workflows. The whole orchestration factory relies heavily on calling remote workflows; that is on the Call Remote Workflow node.
- Triggering Model Retraining. An important part of model monitoring is to know when exactly to start the retraining procedure. A few workflows in the KNIME Model Process Factory are dedicated to check whether model performance has fallen below a specified accuracy threshold and to retrigger its retraining, if needed.
- Full Working Examples. As usual, we provide full working example workflows - including data - to show how to handle typical modelling process tasks and conditions.
Anyone using KNIME can take advantage of the KNIME Model Factory. It is available on the KNIME Hub and runs on KNIME Analytics Platform, which means it is open source and free. Major benefits can be realized in terms of automation and interfacing by using the KNIME Model Factory with KNIME Server.
There is a tremendous amount of information available - far too much for a blog entry - and we would encourage you to look at
.
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)