Lesson 4. Advanced Machine Learning

KNIME-Data-Wranglers-L1-Lesson1

In this lesson we introduce you to advanced data mining algorithms, such as tree ensemble models. They tend to give more accurate and robust results compared to simple models, though they require more settings. The search for the best performing hyperparameter setting can be automated with a parameter optimization loop.

This lesson includes exercises. The corresponding data files, solution workflows, and prebuilt, empty exercise workflows with instructions are available in the L2-DS KNIME Analytics Platform for Data Scientists - Advanced folder in the E-Learning repository on the KNIME Hub.

Ensemble Trees

Ensemble trees generate predictions using multiple decision/regression trees. The final predictions might come from aggregated predictions of the individual models (bagging), or from weighted predictions in cascade (boosting).

Random Forest

A Random Forest is a supervised classification algorithm that builds n slightly differently trained Decision Trees and aggregates their predictions. In the videos below we introduce the theory behind the Random Forest algorithm and show how to use the Random Forest Learner and Predictor nodes.

 

 

A reference workflow Training a Random Forest is available on the KNIME Hub.

 

Exercise: Random Forest

1) Read the file letter-recognition.csv available in the data folder on the KNIME Hub. This dataset was downloaded from UC Irvine Machine Learning Repository.

Here we have an image recognition problem. Each image contains an alphabet letter that is described by various measures. The Col0 contains the target class (the letter). All other input features are measures of the image.

2) Train a Random Forest model to predict the alphabet letter in the Col0 column:

  • Partition the dataset into a training set (80%) and a test set (20%). Perform stratified sampling on the target column.
  • Train a Random Forest model on the training set to predict values in the target column. Train 5 trees with minimum node size 2.
  • Apply the trained model to the test set
  • Evaluate the accuracy of the model by scoring metrics for a classification model

3) OPTIONAL: train a Random Forest with 100 trees, and compare the performances of the two models


Empty exercise workflow 06_Random_Forest in the KNIME Hub course repository.

 

Solution: Random Forest

1-2) Download the letter-recognition.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node. Use next the Partitioning node, set 80 as the relative size, select “Stratified sampling”, and select the “col0” column in the menu. Connect a Random Forest Learner node to the top input of the Partitioning node (the training set), and select the “col0” column as the target column. Include all other columns as predictors, enable the “Minimum node size” option, and set it to 2. Go to the Number of models setting, and set it to 5. Connect the model output and the bottom output of the Partitioning node (the test set) to a Random Forest Predictor node. Connect a Scorer or Scorer (JavaScript) node to the Random Forest Predictor node, and select the “col0” column as the first/actual column and the “Prediction (col0)” column as the second/predicted column. The overall accuracy of the model is 89.20 %.

OPTIONAL: Set the Number of models setting in the Random Forest Learner node to 100. The overall accuracy of the model is now 95.78 %.


Solution workflow 06_Random_Forest - Solution in the KNIME Hub course repository.

 

Model Optimization and Validation

Some algorithms are sensitive to their configuration, and finding the best performing values manually can be tedious. A parameter optimization loop trains a number of models with different hyperparameter values, and returns the best parameter values based on a cost function - for example, overall accuracy.

 

A reference workflow 2 Examples for Parameter Optimization Loops is available on the KNIME Hub. 

We conclude here this [L2-DS] KNIME Analytics Platform for Data Scientists: Advanced course. Well done! You can practice with deployment and special analytics types in the courses on L3 and L4 levels.

LinkedInTwitterShare