Welcome to the Integrated Deployment Blog Series, a series of articles focusing on solving the challenges around productionizing data science.
In the past two episodes we have seen how it is possible to use KNIME's Integrated Deployment approach to deploy a model. We have shown how to manually train a model and deploy it automatically before moving on to looking at automating the retraining of a previously selected model - leading to automated deployment.
Our previous integrated deployment examples are excellent for productionizing one single use case, i.e. one specified dataset with one specified categorical target. If you were to adapt our examples to your use case, it would work in production when the data and kind of prediction stay the same. However, if you were to add more data or even change what you are trying to predict you would find yourself having to change the workflow to reconfigure some of the node settings.
Wouldn’t it be great to have an Integrated Deployment strategy that automatically adapts itself when something changes?
A rhetorical question, of course it would be nice!
Automatic adaption to change
We are in fact talking about an automated machine learning (AutoML) solution for training and deploying machine learning classification models. Despite their popularity, AutoML solutions can be time consuming to implement and are only effective for a subset of machine learning problems.
In 2018, we released a complex Guided Automation workflow group designed to provide detailed coverage of the various aspects of AutoML - for example the user journey, complex interactive views, feature engineering, parameter optimization and machine learning interpretability. By the way - future plans for this large Guided Automation workflow include updating it to include Integrated Deployment strategies, too. But first, let’s address the problem that AutoML solutions are often complex to implement.
New AutoML component eases implementation of AutoML solutions
We have now developed a single component which flexibly automates the training, validation, and deployment of up to nine machine learning algorithms, combining them with any other required node. The new AutoML component, publicly available on KNIME Hub and part of a collection of Verified Components, can be easily dragged and dropped to your installation of KNIME Analytics Platform 4.2 or higher.
Figure 1: Animation showing the AutoML Verified Component being used.Training several models in KNIME Analytics Platforms has never been easier. After configuring and executing, inspect the results in an interactive view to monitor progress. The selected trained model can be used on a different dataset for additional testing via the Workflow Executor node. In this example, on a small dataset, the component automatically trains and outputs a Keras Deep Learning model on the fly for simple demo purposes.
Open source users can simply drag and drop the new component to KNIME Analytics Platform from the KNIME Hub and configure it just like any other Learner node (Fig. 1). The executing component goes through the various steps of the AutoML process based on the user settings specified in the component dialogue. At the end, the component automatically selects the best available model and it exports it as a deployment workflow via an Integrated Deployment connection.
Best practice: understanding the output
To manually use the component’s output as a black-box model, the user can simply add a Workflow Executor node and provide data. It is good practice to decipher the created machine learning black-box by combining the Workflow Executor node with other nodes from the Machine Learning Interpretability Extension and compute charts and explanations for the new predictions.
A more manual way to understand the component output is available by manually inspecting the actual workflow behind the model. This can be done with a Workflow Writer node. Optionally the user can already deploy the model by connecting a Deploy Workflow to Server node.
The component also offers an optional interactive view (Fig.2) which allows you to not only change the model selection but also inspect the performance of all the other models that were trained.
Figure 2: The interactive view generated by the AutoML component: In the top part of the view an overview bar chart of computed performance metrics is shown and a table listing all successfully trained models. Both the bar chart and table are sorted using the user defined performance metrics, in this case “Accuracy”. The model automatically selected is the top one in the table, in this case “XGBoost Trees”. To change the model exported by the component the user has to perform a selection on the table and select “Apply” and “Close” in the bottom right hand corner of the view. Below a more advanced visualization is provided with ROC Curves and confusion matrices for each model (only available for binary classification).
Inspecting the workflow inside the AutoML component
The workflow inside the component can always be inspected and if needed customized. To organize its complexity we used so-called Nested Components for each different stage of the AutoML process.
The workflow inside the AutoML component starts with Configuration nodes. These are needed to expose settings and parameters to the user via the component dialogue (Fig. 3: A). After joining all the different flow variable settings in a single stream, the AutoML DataPrep (Fig. 3: B) handles automating missing value imputation, normalization and encoding of categorical features into numericals.
Based on the domain of the raw data and the user settings the data are prepared correctly and captured for later with Integrated Deployment. As a final step in the data preparation stage the dataset is split into train and test partitions. The prepared train data are passed to the AutoML Learner which trains the user-defined Machine Learning algorithms (Fig. 3: C1) with hyperparameter tuning using cross validation. If one of the models fails to train, by using Try and Catch Errors nodes (Fig. 3: C2), the Component discards them and keeps on training the remaining models. Once all models have been trained and captured via Integrated Deployment, the AutoML Learner exports a table storing the models. The said table is passed into the top port of the AutoML Predictor (Fig. 3: D); its second port receives the test data from AutoML Data Prep.
Figure 3: Illustrating the workflow inside of the AutoML component for training classifier models: The component dialogue (A) is generated by Configuration nodes; data preparation is automated depending on the proposed dataset and user settings in AutoML DataPrep Nested Component (B); the training of 9 different machine learning algorithms takes place in the AutoML Learner (C1) where each model parameter optimization takes place in a dedicated Meta Learner Sub-Nested Component (C2); trained models are applied to test data (D), scored against ground truth (E) and the best one is automatically selected (F). The final model is exported together with additional required workflow segments (e.g. required data preparation) as a deployment workflow thanks to Integrated Deployment.
Each of the models is then applied to the previously prepared test data and the predictions are passed from the AutoML Predictor to the AutoML Scorer so that the performance of each model can be measured (Fig. 3: E). Based on the metric selected by the user the best model is selected by AutoML Best Model (Fig. 3: F) and exported. Before exporting the model from the main AutoML component, a Workflow Combiner node is applied. We use Integrated Deployment to enhance the machine learning model by adding pre-process and post-process models to create a perfect deployment workflow.
The workflow exported by the AutoML component behaves like a black-box machine learning model: raw data goes in, predictions and raw data come out. The AutoML component output model (or workflow depending on the point of view) always guarantees the “black-box requirements”: no matter what was selected by the user in the component dialogue (e.g. a different model or a different kind of data preparation) the exported model is always able to process the raw data and append predictions at its output. This is possible thanks to the Integrated Deployment approach, which captures and combines the necessary pieces of workflow as needed. For example the AutoML component is now capable of training a Deep Learning model with the KNIME Deep Learning - Keras Integration. Exporting such a complex model as a deployment workflow (Fig. 4) is enabled with Integrated Deployment - capturing the Keras Network Executor node alongside any other nodes thus guaranteeing the black-box requirements.
Figure 4: Keras Network Deployment workflow: The workflow depicted was automatically created by the AutoML component. It predicts churn in an example customer dataset. The workflow was written by the Workflow Writer node connected to the output of the AutoML Component. Providing a similar result manually would require a deep learning expert - and way more than just a few minutes.
The AutoML component currently addresses supervised learning for binary and multiclass classification topics. In future, we plan to update it and to release more compatible components that use Integrated Deployment.
Tip: When you use the AutoML component yourself keep it linked to make sure it is always updated with most recent version on KNIME Hub. Note that any new component that is released as part of our collection of Verified Components is a component developed by KNIME experts not merely as simple examples but with reliable functionalities to be used in the same way as you would use other standard KNIME nodes.
In the next episode in this blog series, we are going to use the AutoML component to create a Guided Analytics application via KNIME WebPortal. The application empowers anyone with a web browser to interactively control the settings of the AutoML component without requiring any knowledge of KNIME nodes and KNIME workflows. Stay tuned for our next episode!
Authors: Paolo Tamagnini & Mahantesh Pattadkal (KNIME)
The Integrated Deployment KNIME Blog Series
Authors: Paolo Tamagnini & Mahantesh Pattadkal (KNIME)
- Episode 1: An Introduction to Integrated Deployment
- Episode 2: Continuous Deployment
- Episode 3: Automated Machine Learning