Some time ago, we set our mind to solving a popular Kaggle challenge offered by a Japanese restaurant chain: predict how many future visitors a restaurant will receive.
This is a classic demand prediction problem: how much energy will be required in the next N days, how many milk boxes will be in demand tomorrow, and how many customers will visit our restaurants tonight? We already know how to use KNIME Analytics Platform to solve this kind of time series analytics problems (see whitepaper on energy prediction). So, this time we decided to go for a different approach: a mixed approach.
Thanks to the open architecture of KNIME Analytics Platform, we can practically plug in almost any open source analytics tool, such as Python, R, Weka, to name just three very prominent examples - and, more recently also H2O.
We already developed a cross-platform ensemble model to predict flight delays (another popular challenge). Here, cross-platform means that we trained a model with KNIME, a model with Python, and a model with R. These models from different platforms were then blended together as an ensemble model in a KNIME workflow. Indeed, one of KNIME Analytics Platform’s many qualities consists of its capability to blend data sources, data, models, and, yes, also tools.
For this restaurant demand prediction challenge we decided to raise the bar and develop a solution using the combined power of KNIME Analytics Platform and H2O.
The KNIME H2O Extension
The integration of H2O in KNIME offers an extensive number of nodes and encapsulating functionalities of the H2O open source machine learning libraries, making it easy to use H2O algorithms from a KNIME workflow without touching any code - each of the H2O nodes looks and feels just like a normal KNIME node - but the workflow reaches out to the high performance libraries of H2O during execution.
Figure 1. All available nodes in the KNIME H2O extension
(click to enlarge image)
The Data from the Kaggle Challenge
Eight different datasets are available in this Kaggle challenge.
Three of the datasets come from the so called AirREGI (air) system, a reservation control and cash register system. Two datasets are from Hot Pepper Gourmet (hpg), another reservation system. Another dataset contains the store IDs from the air and the hpg systems, which allows you to join the data together.
Another dataset gives basic information about the calendar dates. At first I wondered what this might be good for, but the fact that it flags public holidays came in quite handy.
Last but not least there is a file that contains instructions for the work submission. Here, you must specify the dates and stores for your model predictions. More information on the datasets can be found at the challenge web page.
Combining the power of KNIME and H2O in a single workflow
To solve the challenge, we implemented a classic best model selection framework according to the following steps:
- Data preparation, i.e. reading, cleaning, joining data, and feature creation all with native KNIME nodes
- Creation of a local H2O context and transformation of a KNIME data table into an H2O frame
- Training of three different H2O based Machine Learning models (Random Forest, Gradient Boosted Machine, Generalized Linear Models). Training procedure also includes cross-validation and parameter optimization loops, by mixing and matching native KNIME nodes and KNIME H2O extension nodes.
- Selection of the best model in terms of RMSLE (Root Mean Squared Logarithmic Error) as required by the Kaggle Challenge.
- Deployment, i.e. converting the best model into an H2O MOJO (Model ObJect Optimized) object and running it on the test data to produce the predictions to submit to the Kaggle competition
Figure 2. The KNIME workflow implemented as a solution to the Kaggle restaurant competition. Notice the mix of native KNIME nodes and KNIME H2O extension nodes. The KNIME H2O extension nodes encapsulate functionalities from the H2O library.
(click to enlarge image)
Let’s take a look at these steps one by one.
- Data Preparation
The workflow starts by reading seven of the datasets available on the Kaggle challenge page.
The metanode named “Data preparation” includes flagging weekend days vs. business days; joining reservation items; aggregating (mean, max, and min) on groups of visitors, as by restaurant genre and/or geographical area.
The dataset contains a column indicating the number of visitors for a particular restaurant on a given day. This value will be used as the target variable to train the predictive models later on in the workflow. At the end of the data preparation phase, the dataset is then split in two parts: one part with the rows with a non-missing value for the field “number of visitors” and one part containing the remaining records with missing number of visitors. The last dataset represents the test set upon which the predictions will be calculated to submit to the Kaggle competition.
Figure 3. This is the sub-workflow contained in the “Data preparation” metanode. It implements weekend vs. business day flagging, data blending via joining, as well as a few aggregations by restaurant group.
(click to enlarge image)
As you can see from the screenshot in Figure 3, the data processing part was implemented solely with native KNIME nodes, so as to have a nicely blended, feature enriched dataset in the end.
- Creation of Local H2O Context
To be able to use the H2O functionalities, you need to start an H2O environment. The H2O Local Context node does the job for you. Once you’ve created the H2O context, you can convert data from your KNIME data tables into H2O frames and train H2O models on these data.
- Training Three Models
As the prediction problem was a regression task, I chose to train the following H2O models: Random Forest, Generalized Linear Model, and Gradient Boosting Machine algorithm.
The H2O models were trained and optimized inside the corresponding metanodes in Figure 1. Let’s take a look for example at the metanode named “Gradient Boosting Machine” (Fig. 4). Inside the metanode you’ll see the classic Learner-Predictor motif, but this time the two nodes rely on H2O based code. The “Scoring (RMSLE)” metanode calculates the error measure. We repeat this operation five times using a cross-validation framework.
The cross-validation framework is interesting. It starts with an H2O node - H2O Cross-Validation Loop Start node - and it ends with a native KNIME Loop End node. The H2O Cross-Validation Loop Start node implements the H2O cross-validation procedure extracting a random different validation subset at each iteration. The Loop End node collects the error measure from each cross-validation result. The two nodes blend seamlessly, even though they refer to two different analytics platforms.
On top of all that, an optimization loop finds the optimal parameters of the specific model for the smallest RMSLE average error. This loop here is completely controlled via native KNIME nodes. The best parameters are selected via the Element Selector node and the model is trained again on all training data with the optimal parameters.
As you can see, the mix and match of native KNIME nodes and H2O functionalities is not only possible, but actually quite easy and efficient.
Figure 4. Content of the “Gradient Boosting Machine” metanode , including model training and model prediction, cross-validation loop, and optimization loop. Notice the H2O Cross-validation Loop Start node blends seamlessly with the native KNIME Loop End node.
(click to enlarge image)
- Selecting the Best Model
As a result of the previous step I have a table with three different models with their respective RMSLE scores. The workflow then selects the model that scored best with the Element Selector node, in the metanode named “Select best model”.
Afterwards the model is transformed into an H2O MOJO (Model ObJect, Optimized) object. This step is necessary in order to use an H2O model outside of an H2O context and to use the general H2O MOJO Predictor node.
- Predictions to Kaggle
Remember that the blended dataset was split in two partitions? The second partition, without the number of visitors, is the submission dataset.
The MOJO model that was just created is applied to the submission dataset. The submission dataset, this time with predictions, is then transformed into the required Kaggle format and sent to Kaggle for evaluation.
The final workflow can be found on the KNIME EXAMPLES server under 04_Analytics/15_H2O_Machine_Learning/07_Customer_prediction_with_H2O04_Analytics/15_H2O_Machine_Learning/07_Customer_prediction_with_H2O* and is described in this page of the Node Guide.
We did it!
We built a workflow to solve the Kaggle challenge.
The workflow blended native KNIME nodes and KNIME H2O Extension nodes, thus combining the power of KNIME Analytics Platform and H2O under the same roof.
The mix and match operation was the easiest part of this whole project. Indeed, both the KNIME and H2O open source platforms have proven to work well together, complementing each other nicely.
We built this workflow not with the idea of winning (the submission deadline was over before we even got our hands on this anyway), but to showcase the openness of KNIME Analytics Platform and how easily and seamlessly the KNIME H2O extension integrates H2O in a KNIME workflow. The workflow indeed can still be improved, maybe with additional machine learning models, with more sophisticated feature engineering, or with the adoption of an ensemble model rather than of the best selected model.
Well now, would you like to know how we scored on Kaggle? Our submission had a RMSLE of 0.515, which puts it in the top 3% of more than 2000 submissions. Considering that we spent just a few days on this project, we are quite satisfied with this result.
Do you think you could do better? Give it a try! Download our KNIME template from the example server, make the changes needed to improve the score and let us know about your success at firstname.lastname@example.org.
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)