In this lesson we introduce you to advanced data mining algorithms, such as tree ensemble models. They tend to give more accurate and robust results compared to simple models, though they require more settings. The search for the best performing hyperparameter setting can be automated with a parameter optimization loop.
This lesson includes exercises. The corresponding data files, solution workflows, and prebuilt, empty exercise workflows with instructions are available in the L2-DS KNIME Analytics Platform for Data Scientists - Advanced folder in the E-Learning repository on the KNIME Hub.
Ensemble trees generate predictions using multiple decision/regression trees. The final predictions might come from aggregated predictions of the individual models (bagging), or from weighted predictions in cascade (boosting).
A Random Forest is a supervised classification algorithm that builds n slightly differently trained Decision Trees and aggregates their predictions. In the videos below we introduce the theory behind the Random Forest algorithm and show how to use the Random Forest Learner and Predictor nodes.
A reference workflow Training a Random Forest is available on the KNIME Hub.
Here we have an image recognition problem. Each image contains an alphabet letter that is described by various measures. The Col0 contains the target class (the letter). All other input features are measures of the image.
2) Train a Random Forest model to predict the alphabet letter in the Col0 column:
- Partition the dataset into a training set (80%) and a test set (20%). Perform stratified sampling on the target column.
- Train a Random Forest model on the training set to predict values in the target column. Train 5 trees with minimum node size 2.
- Apply the trained model to the test set
- Evaluate the accuracy of the model by scoring metrics for a classification model
3) OPTIONAL: train a Random Forest with 100 trees, and compare the performances of the two models
Empty exercise workflow 06_Random_Forest in the KNIME Hub course repository.
OPTIONAL: Set the Number of models setting in the Random Forest Learner node to 100. The overall accuracy of the model is now 95.78 %.
Solution workflow 06_Random_Forest - Solution in the KNIME Hub course repository.
Model Optimization and Validation
Some algorithms are sensitive to their configuration, and finding the best performing values manually can be tedious. A parameter optimization loop trains a number of models with different hyperparameter values, and returns the best parameter values based on a cost function - for example, overall accuracy.
A reference workflow 2 Examples for Parameter Optimization Loops is available on the KNIME Hub.
We conclude here this [L2-DS] KNIME Analytics Platform for Data Scientists: Advanced course. Well done! You can practice with deployment and special analytics types in the courses on L3 and L4 levels.