# Lesson 4. Machine Learning & Data Export

Once you have cleaned and properly prepared your data, you can move on to train a machine learning model. Different machine learning algorithms are available in KNIME Analytics Platform. For example classic and modern algorithms, supervised and unsupervised algorithms, algorithms from the field of statistics or from the machine learning community, those that predict numeric values or nominal classes or algorithms that explore patterns, requiring past time series, or just a random sample of data.

After selecting the algorithm you need, you can start the model building process. Here we guide you through it, from the process overview to building basic classification and regression models. We also go from showing you the theory behind the algorithms through to evaluating how they perform.

This lesson includes exercises, and the data files, solution workflows, and prebuilt, empty exercise workflows with instructions are available in the L1-DS KNIME Analytics Platform for Data Scientists - Basics folder in the E-Learning repository on the KNIME Hub.

## Basic Concepts of Data Science

Let’s start with the main concepts behind training and applying a machine learning model.

### The Learner-Predictor Construct

For supervised algorithms, a training phase, a test phase, and optionally an optimization phase are implemented before transferring the model into production. This video shows how to cover these phases in KNIME Analytics Platform through a Learner-Predictor construct.

### Example of Training and Testing a Machine Learning Model

Building a machine learning model is more than just training the algorithm. Actually it’s the whole process from raw data to model evaluation and deployment. In the video below we summarize the most common steps required to produce a valid machine learning model.

## Regressions

Regressions are one of the oldest self-learning methods used in predictive analytics, either to predict nominal classes (logistic regression) or numeric values (linear and polynomial regression).

### Logistic Regression

Logistic regression is a regression model used to predict nominal classes. Classic logistic regression works for binary classification problems, but it has been extended to predict more than two classes - with multinomial logistic regression. If the output classes are ordered, we talk about ordinal logistic regression.

A reference workflow Logistic Regression is available on the KNIME Hub.

Exercise: Logistic Regression

1) Read data wine.csv available in the data folder on the KNIME Hub

2) Train a Logistic Regression Model to predict whether a wine is red or white:

• Use the Normalizer (PMML) node to z-score normalize all numerical columns
• Partition the dataset into a training set (80%) and a test set (20%). Apply stratified sampling on the color column.
• Train a logistic regression model on the training set, and apply the model to the test set
• OPTIONAL: use the Scorer node to evaluate the accuracy of the model

Empty exercise workflow 17_Logistic_Regression in the KNIME Hub course repository.

Solution: Logistic Regression

Download the wine.csv file from the data folder on the KNIME Hub, and read the file with the File Reader node. Use the Normalizer (PMML) node, include all columns, and select “Z-Score Normalization”. Use next the Partitioning node, set 80 as the relative size, select “Stratified sampling”, and select the “color” column in the menu. Connect a Logistic Regression Learner node to the top input of the Partitioning node (the training set), select the “color” column as the target column. Include all columns as predictor columns. Connect the model output and the bottom output of the Partitioning node (the test set) to a Logistic Regression Predictor node.

OPTIONAL: Connect a Scorer or Scorer (JavaScript) node to the Logistic Regression Predictor node, and select the “color” column as the first/actual column and the “Prediction (color)” column as the second/predicted column.

Solution workflow 17_Logistic_Regression - Solution in the KNIME Hub course repository.

## Decision Tree Family

Another big family of classifiers consists of decision trees and their modern ensemble evolutions. A decision tree splits the original dataset into two or more subsets at each step so as to better isolate the target classes. Each split can be graphically represented as a tree node and translated into a rule. The sequences of splits, i.e. the tree branches, define a set of rules that lead to isolated target classes.

One of the first widely-known decision tree algorithms was published by R. Quinlan as C4.5 in 1993 (Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993). Over time, the original algorithm has been improved for better accuracy by adding new quality measures and pruning techniques, and parallelizing for faster execution.

The KNIME implementation of the decision tree refers to: "C4.5 Programs for machine learning", by J.R. Quinlan and in SPRINT: A Scalable Parallel Classifier for Data Mining, by J. Shafer, R. Agrawal, M. Mehta.

A reference workflow Training a Decision Tree is available on the KNIME Hub.

Exercise: Decision Tree Family

1) Read the file adult.csv available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Train a Decision Tree to predict whether or not a person earns more than 50K per year:

• Partition the dataset into a training set (75%) and a test set (25%). Apply stratified sampling option on the income column.
• Train a Decision Tree model on the training set, and apply the model to the test set

OPTIONAL:

• Use the Scorer node to evaluate the accuracy of the model
• Try out other parameter settings to get a higher accuracy. For example, change the quality measure, pruning method, or minimum number of records.

Empty exercise workflow 18_Decision_Tree in the KNIME Hub course repository.

Solution: Decision Tree Family

Download the adult.csv file from the data folder on the KNIME Hub, and read the file with the File Reader node. Use next the Partitioning node, set 75 as the relative size, select “Stratified sampling”, and select the “income” column in the menu. Connect a Decision Tree Learner node to the top input of the Partitioning node (the training set), and select the “income” column as the target column. Connect the model output and the bottom output of the Partitioning node (the test set) to a Decision Tree Predictor node.

OPTIONAL: Connect a Scorer or Scorer (JavaScript) node to the Decision Tree Predictor node, and select the “income” column as the first/actual column and the “Prediction (income)” column as the second/predicted column. Try, for example, to increase the minimum records per node from the default value 2 to 4, and inspect the scoring metrics again.

Solution workflow 18_Decision_Tree - Solution in the KNIME Hub course repository.

## Model Evaluation

After training a classification or prediction model, the next step is to evaluate its performance. The model evaluation step consists of two parts: applying the model to a test set, and comparing the predicted values and actual values.

### Confusion Matrix and Class Statistics

A confusion matrix shows the numbers of correct and incorrect predictions in the target classes, and you can use it to calculate the different class statistics and overall accuracy statistics.

Exercise: Confusion Matrix and Class Statistics

1) Read the file predicted_income.csv available in the data folder on the KNIME Hub. The original adult.csv dataset is provided by the UCI Machine Learning Repository.

The “income” column contains people’s actual income class values. The “Prediction (income)” column contains their predicted income class values produced by some classification model based on the other information available in the dataset. The income class has two values: “<=50K” and “>50K”

2) Evaluate the accuracy of the income class prediction using the Scorer (JavaScript) node. Execute the node, and open the interactive view.

- What is the overall accuracy of the model?

3) Open the configuration dialog of the Scorer (JavaScript) node and exclude the following statistics from the class prediction statistics table:

True positives, False positives, True negatives, False negatives

4) Open the interactive view again, and display the number of rows in the confusion matrix. What is the number of rows in the dataset?

Empty exercise workflow 19_Confusion_Matrix_and_Class_Statistics in the KNIME Hub course repository.

Solution: Confusion Matrix and Class Statistics

1 - 2) Download the predicted_income.csv file from the data folder on the KNIME Hub, and read the file with the File Reader node. Use next the Scorer (JavaScript) node, and select “income” as the actual column and “Prediction (income)” as the predicted column. Open the interactive view output of the node. The overall accuracy is 83.68 %.

3) Open the Statistics Options tab in the configuration dialog, check the “Display class statistics table” option, and uncheck the four statistics

4) Click on the menu icon in the top right corner, and check the “Display number of rows” option. The number of rows is 6513.

Solution workflow 19_Confusion_Matrix_and_Class_Statistics - Solution in the KNIME Hub course repository.

### ROC Curve

Another very powerful technique for measuring the quality of a classification model is the Receiver Operating Characteristic (ROC) curve.

A reference workflow Evaluating Classification Model Performance is available on the KNIME Hub.

Exercise: ROC Curve

1) Read the file predicted_gender.csv available in the data folder on the KNIME Hub. The original adult.csv dataset is provided by the UCI Machine Learning Repository.

The “sex” column contains people’s actual gender: Female or Male. The “Prediction (sex) ...” columns contain their gender values predicted by two different classification models - a decision tree (DT) and logistic regression model (LR). The “P(sex=Female)...” columns contain the predicted probabilities of being female produced by the two models.

2) Evaluate the performance of the decision tree model using the ROC curve node

- Set the Class column, Positive class value, and Columns containing the positive class probabilities in the configuration dialog

- Execute the node and open the interactive view

- What is the area under the curve for the decision tree model?

3) Compare the performance of the decision tree and logistic regression models by plotting their ROC curves in the same graph

- Open the configuration dialog of the ROC Curve node

- Add the relevant columns to the Columns containing the positive class probabilities

- Which of the models perform better?

- What is the area under the curve for the logistic regression model?

OPTIONAL: open the interactive view again and change the title of the view to “Performance of Decision Tree and Logistic Regression Models in Predicting Gender”.

Empty exercise workflow 20_ROC_Curve in the KNIME Hub course repository

Solution: ROC Curve

1 - 2) Download the predicted_gender.csv file from the data folder on the KNIME Hub, and read the file with the File Reader node. Use next the ROC Curve node, and select “sex” as the class column, Female as the positive class value, and include the “P (sex=Female)DT” column. Open the interactive view output of the node. The AuC value is 0.849.

3) Also include the “P (sex=Female)LR” column in the configuration dialog of the ROC Curve node. The logistic regression model performs better, its AuC value is 0.931.

OPTIONAL: click on the menu icon in the top right corner of the interactive view, and write the title in the Chart Title field.

Solution workflow 20_ROC_Curve - Solution in the KNIME Hub course repository.

### Error Metrics

When you evaluate a numeric prediction, for example by using a regression model or a time series model, you need error metrics that tell you the size and/or direction of the prediction error.

A reference workflow Evaluating the Performance of a Regression Model is available on the KNIME Hub.

Exercise: Regression Model Evaluation

1) Read the file german-credit-scoring.csv available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Partition the data into a training set (75 %) and test set (25 %). Draw randomly.

3) Train a linear regression model on the training set to predict the duration of a credit. Use all other columns for the prediction.

4) Apply the model to the test set

5) Evaluate the performance of the linear regression model with the Numeric Scorer node. Which proportion of the variance of the credit duration does the model explain? How many months is the mean absolute error of the model?

Empty exercise workflow 21_Regression_Model_Evaluation in the KNIME Hub course repository.

Solution: Regression Model Evaluation

1-2) Download the german-credit-scoring.csv file from the data folder on the KNIME Hub, and read the file with the File Reader node. Use next the Partitioning node, set 75 as the relative size, and select “Draw randomly”.

3-4) Connect a Linear Regression Learner node to the top input of the Partitioning node (the training set), and select the “Duration in months” column as the target column. Include all other columns as predictor columns. Connect the model output and the bottom output of the Partitioning node (the test set) to a Regression Predictor node.

5) Connect a Numeric Scorer node to the Regression Predictor node. Select the “Duration in months” column as the reference column and the “Prediction (Duration in months)” as the predicted column. The proportion of the variance explained is about 50 % as given by the R-squared metric. The mean absolute error (MAE) of the model is about 7 months.

Solution workflow 21_Regression_Model_Evaluation - Solution in the KNIME Hub course repository.

## Write to a File

KNIME Analytics Platform provides different deployment options: exporting data in different file formats, writing models in PMML format, integrating an external reporting tool, creating a REST API, and building an Analytics Application accessible via the KNIME WebPortal. The different file formats have their own writer nodes. Here we introduce one of them, the CSV Writer node.

Exercise: Write to a File

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Calculate the total number of rows and average age for all women with income >50K per year

3) Write the resulting table as a CSV file into the data folder using the knime:// protocol

Empty exercise workflow 22_Write_Data_to_File in the KNIME Hub course repository.

Solution: Write to a File

Download the adult.csv file from the data folder on the KNIME Hub, , and read the file with the File Reader node. Use, for example, the Rule-based Row Filter node with the following expression to filter the data:

\$sex\$ = "Female" AND \$income\$ = ">50K" => TRUE

Use the GroupBy node to calculate the average age and row count in the filtered data.

Export the aggregated table with the CSV Writer node. Start the file path with “knime://”, and continue with the path from the currently active workflow to the data file, for example “knime.workflow/../../data/women_aggregated.csv”.

Solution workflow 22_Write_Data_to_File - Solution in the KNIME Hub course repository.

## Export Data to a Report

KNIME Analytics Platform does not offer a native reporting solution, but rather it integrates with a number of commercial reporting platforms, like BIRT, Tableau, Spotfire, and Power BI.

### Export Data into a BIRT Report

BIRT (Business Intelligence Reporting Tool) is a reporting solution with an open source component. The open source component is integrated within KNIME Analytics Platform via its KNIME Report Designer extension.

Reference workflows are available in the Examples/05_Reporting/01_BIRT repository on the KNIME Hub.

Exercise: Export Data into a BIRT Report

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Create a pivot table, by workclass, that counts the number of records for each income class

3) Sort the pivot table by workclass

4) Send the data to BIRT for use in a report

5) In the BIRT Report editor, use a grid to layout the following elements:

- A report title

- A formatted pivot table

BONUS) Create a bar chart of counts with workclass on the x-axis, and income classes in two different series on the y-axis

Empty exercise workflow 23_Export_to_BIRT_Report in the KNIME Hub course repository.

Solution: Export Data into a BIRT Report

2-3) Create the pivot table with the Pivoting node. Select “workclass” as the group column, and “income” as the pivot column. Apply the aggregation method “Count” to any column. Sort the pivot table with the Sorter node. Select “workclass” as the sorting column.

4-5) Use the Data to Report node to send the pivot table to a BIRT report. Open the report editor. Open the Master Page tab, and drag a Grid into the Header field. Drag a Label into the Grid, and write the title of your report in the field that activates. Open the Layout tab, and drag again a Grid into the report. Drag a Table into a Grid cell. In the dialog that opens, select your pivot table as the dataset.

BONUS) Use the Bar Chart node, select “workclass” as the category column and both income columns as the y-axis columns. Apply sum or average as the aggregation method. Check “Generate image” on top of the configuration dialog. Connect the node’s image output to an Image to Report node. In the report editor, drag an Image into a Grid cell. In the dialog that opens, select “Dynamic image” > “Select Image Data…”. In the next dialog that opens, select the bar chart image as the dataset, and check the “Image” item in the list below.

Solution workflow 23_Export_to_BIRT_Report - Solution in the KNIME Hub course repository.

### Export Data into a Tableau Report

Tableau is a popular reporting solution. It is not free, neither is it open source. If you have a Tableau license, you can use the nodes from the KNIME Tableau integration to export data directly into a Tableau TDE file or into a Tableau Server.

Reference workflows are available in the Examples/05_Reporting/02_Tableau on the KNIME Hub.

We conclude here this L1-DS] KNIME Analytics Platform for Data Scientists: Basics course. Well done! You can practice with advanced data manipulation and machine learning in the [L2-DS] KNIME Analytics Platform for Data Scientists: Advanced course.