Decision Tree

The decision tree is a classic predictive analytics algorithm to solve binary or multinomial classification problems.

One of the first widely-known decision tree algorithms was published by R. Quinlan as C4.5 in 1993 (Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993). Over time, the original algorithm has been improved for better accuracy by adding new quality measures and pruning techniques, and parallelized for faster execution.

The prediction results produced by a decision tree algorithm can be improved by changing settings like the quality measure, the splitting criterion, minimum number of records per node, or pruning technique.

To optimize the parameters it’s helpful to understand the theory behind the algorithm. For example, how a different splitting criterion affects the prediction, or what actually happens in the pruning phase.

Below are some videos. First we explain the theory behind the decision tree, then we show how to build a decision tree model in KNIME Analytics Platform.

The KNIME implementation of the decision tree refers to the following publication: "C4.5 Programs for machine learning", by J.R. Quinlan and in "SPRINT: A Scalable Parallel Classifier for Data Mining", by J. Shafer, R. Agrawal, M. Mehta.

 

The workflow shown in the "Decision Tree Learner Node: Algorithm Settings" video can be found on the EXAMPLES server under 04_Analytics/04_Classification_and_Predictive_Modelling /07_Decision_Tree04_Analytics/04_Classification_and_Predictive_Modelling /07_Decision_Tree*.

Exercise

Read the adult.csv dataset.

Train a Decision Tree to predict whether or not a person earns more than 50k per year.

  • Partition the dataset into a training set (75%) and a test set (25%) using the Partitioning node with the stratified sampling option on the column “Income”.
  • Use the Decision Tree Learner Node to train the model on the training set and the Decision Tree Predictor Node to apply the model to the test set.
  • Use the Scorer node to evaluate the accuracy of the model.

Optionally:

  • Try out other parameter settings to get a higher accuracy. For example, change the quality measure, pruning method, or minimum number of records.

 

Solution
With the Decision Tree Model we can reach an accuracy of 84% on the test set. The picture below shows the workflow that has been used to solve the exercise.

* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)