Random Forest

The Random Forest model evolved from the simple Decision Tree model, because of the need for more robust classification performance. 

A Random Forest is a supervised classification algorithm that builds N slightly differently trained Decision Trees and merges them together to get more accurate and more robust predictions.

The advantage of such a strategy is clear. While the predictions from a single tree are highly sensitive to noise in the training set, predictions from the majority of multiple trees are not - providing the trees are not correlated. Bootstrap sampling is the way to decorrelate the trees by training them on different training sets. 

In the first video we briefly explain the theory behind the Random Forest model.

In this second video we briefly explain how to configure the Random Forest Learner and Predictor nodes according to the parameters of a Random Forest model.

The workflow shown in the two videos in this section can be found on the KNIME Hub under this link.

Exercise

Read the letter-recognition.csv dataset. This dataset was downloaded from UC Irvine Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Letter+Recognition 

Here, we have an image recognition problem. Each image contains an alphabet letter that is described by various measures. Col0 contains the target class (the letter). All other input features are measures of the image.

Train a Random Forest model to predict the alphabet letter in column Col0.

  • Partition the dataset into a training set (80%) and a test set (20%). Perform stratified sampling on the target column.
  • Train a Random Forest model on the training set to predict values in the target column. Train 5 trees with minimum node size 2.
  • Apply the trained model to the test set.
  • Evaluate the accuracy of the model by scoring metrics for a classification model.

Optionally:

  • Train one random forest with 100 trees and one with 5 trees and compare their performances.
Solution
  • Read the letter-recognition.csv file using the File Reader node.
  • Partition the data set into a training set and a test set using the Partitioning node. Write 80 in the “Relative[%]” field. Check “Stratified sampling”, and select the target column “Col0” in the menu.
  • Connect the top output of the Partitioning node (the training set) to the Random Forest Learner node. In the configuration dialog of the Random Forest Learner node, select “Col0” as the target column. Set the “Minimum node size” setting to 2 and “Number of models” setting to 5.
  • Connect the bottom output port of the Partitioning node (the test set) to the Random Forest Predictor node. Connect the model output of the Random Forest Learner node to the model input of the Random Forest Predictor node.
  • Connect the Scorer (JavaScript) node to the output of the Random Forest Predictor node. Define “Col0” as the actual column and “Prediction (Col0)” as the predicted column.
  • Inspect the performance of the classification model in the interactive view output of the Scorer (JavaScript) node. Right click the node and select “Interactive view: Confusion Matrix” in the menu. The overall accuracy of the model is 75.7%.
  • Optional Task:

    • Use the same training and test set as for the previous model.
    • In the step 3, set the “Number of models” setting to 100.
    • Complete the steps 4 to 6 as for the previous model. The overall accuracy of the model is 83.6 %.
    • The solution workflow is shown below and available for download at https://kni.me/w/siQVYhmV0okIE8Gb

     

    KNIME E-Learning Chapter 6.3 Random Forest

    Performance comparison in overall accuracy:

    KNIME E-Learning Course Chapter 6.3 Random Forest