Lesson 3. Bringing Things Together

KNIME-Data-Wranglers-L1-Lesson1

Aggregations are the key to the calculations of KPI features, data preparation, and feature generation. In this lesson, we’ll show you how to implement a few types of data aggregation operations. In the end we’ll also show how to clean up and organize your workflow by collecting nodes in metanodes.

This lesson includes exercises, and the data files, solution workflows, and prebuilt, empty exercise workflows with instructions are available in the L1-DS KNIME Analytics Platform for Data Scientists - Basics folder in the E-Learning repository on the KNIME Hub.

Jump to the following main sections:

Aggregations

Join and Concatenation

Cleaning Up Workflows with Metanodes

Aggregations

Sometimes the level of detail in your data doesn’t show you the information you need. For example, you need total sales but you only have individual orders, or you want to find out the average monthly temperature from the daily data you have. What you need to do is aggregate the data, i.e. calculate summary statistics. These statistics are often calculated separately for subgroups of the data, such as different months.

Classic Aggregations with the GroupBy node

When you create an aggregated table, simple or complicated, you often need the GroupBy node.

 

 

Reference workflows Basic Examples for using the GroupBy Node and GroupBy Advanced Examples are available on the KNIME Hub.

 

Exercise: Aggregating Data with the GroupBy node

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Calculate the count and average age of women with income >50K

3) Calculate the averages of all numerical columns for each one of the 4 groups defined by sex and income values

4) Calculate 

  • the number of missing values in the occupation column
  • the number of non-missing rows in the occupation column
  • the number of rows in the occupation column
  • the number of rows in the marital-status column 

Notice that the last two aggregations should provide the same numbers!

Empty exercise workflow 13_Classic_Aggregations_with_GroupBy_Node in the KNIME Hub course repository

 

Solution: Aggregating Data with the GroupBy node

1-2) Download the adult.csv file from the data folder on the KNIME Hub. Read the file with the File Reader node. Use the GroupBy node, and select the gender and income columns as the grouping columns. Select age as one aggregation column, and apply mean as the aggregation method. Select any column as another aggregation column, and apply count as the aggregation method. Remember to check the “Missing” option for the count operation.

3) Start a new workflow branch from the File Reader node with a GroupBy node. Select gender and income as the grouping columns. Select integer and double in the type based aggregation tab, and apply the aggregation method mean.

4) Start a new workflow branch again with a GroupBy node. Don’t select any grouping columns. Select the occupation column as the aggregation column three times, and apply the aggregation methods missing value count, count without checking the “Missing” option, and count with the “Missing” option checked. Select the marital status column as the fourth aggregation column, apply the aggregation method count, and check the “Missing” option.

Solution workflow 13_Classic_Aggregations_with_GroupBy_Node - Solution in the KNIME Hub course repository

 

Extending the Classic Aggregations with Pivoting

A pivot table aggregates data by two dimensions, rows and columns.

 

 

 

A reference workflow Basic and Advanced Usage of the Pivoting Node is available on the KNIME Hub.

 

Exercise: Creating a Pivot Table

1) Read the adult_binned.csv file available in the data folder on the KNIME Hub. The original adult.csv dataset is provided by the UCI Machine Learning Repository.

2) Calculate the number of people in groups according to their workclass and age bin

  • What is the most common combination of age bin and workclass?
  • How many people belong to this group?

3) Add income to the grouping columns, and take a look at the income class of the people in the biggest group you calculated in the previous step. Do most of them earn more or less than 50K per year?

4) Remove “income” from the grouping columns again and change the aggregation method

  • What is the most widespread education level according to age bin and workclass?
  • In which workclasses is the most widespread education level of the youngest bin higher than the education of the oldest bin?

Empty exercise workflow 14_Pivoting in the KNIME Hub course repository

 

Solution: Creating a Pivot Table

1-3) Download the adult_binned.csv file from the data folder on the KNIME Hub. Read the file with the File Reader node. Use the Pivoting node, and select age bin as the grouping column and workclass as the pivoting column (or vice versa). Apply the aggregation method count to any column. The biggest subgroup (10936) contains the people who are younger than 35 and work in the private sector. 9842 of these people earn less than 50K, while only 1094 earn more.

4) Apply the aggregation method mode to the education-num column. The most widespread education level is 9 (HS-Grad). The youngest age bin have a higher education level than the oldest age bin, when their workclass is federal-gov, local-gov, or state-gov.

Solution workflow 14_Pivoting - Solution in the KNIME Hub course repository

 

Join and Concatenation

The next step could be to bring data together into one table. Here we show you two ways of blending data: side by side by joining the rows on a common identification value (the key), and on top of each other by concatenating the columns with a common column header.

Join: inner join, right outer join, left outer join, full outer join

Let’s start with the joining operation. In the videos below we introduce the different join modes and explain how to use the Joiner node.

 

 

 

A reference workflow Join example workflow is available on the KNIME Hub.

 

Exercise: Joining Tables

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Calculate the average age and count for each one of the 4 groups defined by sex and income values

3) Join the two aggregated values to the original table

Empty exercise workflow 15_Join in the KNIME Hub course repository

 

Solution: Joining Tables

1-2) Download the adult.csv file from the data folder on the KNIME Hub. Read the file with the File Reader node. Use the GroupBy node, and select sex and income as the grouping columns. Select age as one aggregation column, and apply the aggregation method mean. Select any column as another aggregation column, and apply the aggregation method count.

3) Join the aggregated values to the original table with the Joiner node. Select sex and income as the joining columns. 

Solution workflow 15_Join - Solution in the KNIME Hub course repository

 

Concatenate

Here we explain the concatenate operation and show you how to concatenate tables using the Concatenate node. Notice that from the KNIME Analytics Platform version 4.1 forward, the Concatenate node has optional input ports, so you can concatenate two or more tables with one node.

 

 

A reference workflow Concatenation example is available on the KNIME Hub.

 

Exercise: Concatenating Tables

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Extract people with age between 20 and 40 (both included) and working in a workclass starting with "S"

3) Extract people with age between 40 and 60 (both included) and working in a workclass starting with "P"

4) Concatenate both subsets into a single data table

Empty exercise workflow 16_Concatenate in the KNIME Hub course repository

 

Solution: Concatenating Tables

Download the adult.csv file from the data folder on the KNIME Hub. Read the file with the File Reader node. Use the Rule-based Row Filter node to extract the first subset. Start another workflow branch from the File Reader node, and use again the Rule-based Row Filter node to extract the second subset. Concatenate the two subsets with the Concatenate node. 

Solution workflow 16_Concatenate - Solution in the KNIME Hub course repository

 

Cleaning Up Workflows with Metanodes

When you want to clean your workflow by collecting individual nodes into a subworkflow, you create a metanode of them. At any point, you can add or remove nodes inside the metanode, add inputs and outputs, and expand the metanode back to individual nodes. In the video below we show how you can perform these and other useful operations on metanodes.

 

LinkedInTwitterShare