Lesson 2. Data Cleaning & Visualization

KNIME-Data-Wranglers-L1-Lesson1

Now that you’ve accessed the data from one or more sources, you need to clean, transform and explore them. These operations bring the data into the appropriate shape for further analysis tasks, and let you gain better understanding of the features and relationships in the data. Useful data cleaning operations include row and column filtering, feature transformation and standardization, feature generation, and more. Useful visualizations vary from simple univariate visualizations to interactive dashboards with multiple views and widgets. 

This lesson includes exercises, and the data files, solution workflows, and prebuilt, empty exercise workflows with instructions are available in the L1-DS KNIME Analytics Platform for Data Scientists - Basics folder in the E-Learning repository on the KNIME Hub.

Jump to the following main sections:

Row and Column Filtering

Data Transformation and Standardization

Data Preparation for Time Series Analysis

Data Visualization

Row and Column Filtering

A good start is to exclude unnecessary information. Rows can be removed, for example, if they are duplicates, have missing values, or if they just don’t belong to the subset of interest. Also uninformative or redundant can be removed.

Row Filters

Here we show you a few basic and advanced options for filtering rows: by a string pattern, numerical range, RowID, row number, and rows with missing values.

 

 

 

 

A reference workflow Row Filter is available on the KNIME Hub.

 

 

 

Exercise: Basic Row Filtering

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Filter out rows where the marital status is missing

3) Extract rows where 

  • the marital status is divorced
  • the marital status is never married and age is between 20 and 40 (both included)
  • the workclass starts with "S"

Empty exercise workflow 06_Basic_Row_Filter in the KNIME Hub course repository

 

Solution: Basic Row Filtering

Download the adult.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node, and start a new branch with a Row Filter node for each filtering task. Select the column for filtering in the “Column to test” section, and define the filtering criterion as a string pattern, numeric range, or missing values match. Select the “Exclude rows by attribute value” option, when you filter rows by a missing values match.

Solution workflow 06_Basic_Row_Filter - Solution in the KNIME Hub course repository

 

Exercise: Advanced Row Filtering

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Extract rows where the marital status is divorced or separated using

  • Nominal Value Row Filter node
  • Reference Row Filter node

3) Extract rows where the marital status is never married and age is between 20 and 40 (both included) using the Rule-Based Row Filter node

Empty exercise workflow 07_Advanced_Row_Filter in the KNIME Hub course repository.

 

Solution: Advanced Row Filtering

1-2) Download the adult.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node. Select the “marital-status” column in the Nominal Row Filter node, and only include the values “divorced” and “separated”. 

Start a new workflow branch from the File Reader node, and connect it to the top input of a Reference Row Filter node. Create a reference table that contains a column with two values “divorced” and “separated”. Use this table as the bottom input of the Reference Row Filter node. 

3) Start again a new workflow branch from the File Reader node, and connect it to a Rule-based Row Filter node. Write the following expression in its expression editor: 

Advanced Row Filtering

Check the “Include TRUE matches” option.

Solution workflow 07_Advanced_Row_Filter - Solution in the KNIME Hub course repository

Column Filters

You can select columns manually, by a string pattern matching the column name, or by the data type.

 

 

Reference workflows Column Filter and Workflow to show Column Filter examples are available on the KNIME Hub.

 

Exercise: Column Filtering

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Remove the marital status column

3) Remove all but the marital status column

4) Remove all but string columns

5) Remove all but the marital status column using the Reference Column Filter node

Empty exercise workflow 08_Column_Filter in the KNIME Hub course repository.

 

Solution: Column Filtering

1-3) Download the adult.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node. Use the Column Filter node, and only include the marital status column in its manual selection tab. Start a new branch from the File Reader node with another Column Filter node, and only exclude the marital status column. 

4) Start again a new workflow branch from the File Reader node with a Column Filter node. Select string in its type selection tab.

5) Start a new workflow branch from the File Reader node, and connect it to the top port of a Reference Column Filter node. Use a Table Creator node to create a reference table that contains a “marital-status” column (the table can be empty). Use the reference table as the bottom input of the Reference Column Filter node.

Solution workflow 08_Column_Filter - Solution in the KNIME Hub course repository

 

Data Transformation and Standardization

Before you move on to visualizing, modeling, or reporting data, you’ll need to make sure that the data are in the correct form: the data types are correct, data are normalized where necessary, string values trimmed, numeric values rounded, and new features generated from existing ones where it’s useful.

Data Manipulation: Numbers, Strings, and Rules

When you apply data transformation operations, these very versatile nodes come in handy: the String Manipulation, Math Formula, and Rule Engine nodes. The first video below shows their similarities and differences. The second video shows how you can perform multiple data manipulation tasks at once with the Column Expressions node.

 

 

Reference workflows String Manipulation, Math Formula, and Rule Engine Example and Column Expressions for Data Manipulation are available on the KNIME Hub.

 

Exercise: Data Transformation

1) Read the sales.csv file available in the data folder on the KNIME Hub

2) Create a new column “currency” with value “USD” for the orders from the USA, and “EUR” for the orders from Germany

3) Create a new column “conversion” with value 1 if currency is “EUR”, and 0.88 if currency is “USD” (we refer to the exchange rate of Nov-04-2018)

4) Calculate values in a new column named “amount-in-EUR” by multiplying the value in column “amount” by the value in column “conversion”

5) The product names are of type “prod_1”. Change the product names to “1”, “2”, and so on.

6) Calculate the amount in euros with the Column Expressions node

7) Write an expression that extracts the first three letters of country names and converts them to upper case letters. Append a new column and name it “Country_Code”.

8) Write an expression that assigns the value “N” to the missing values in the “card” column. Replace the “card” column.

Empty exercise workflow 09_Transformation in the KNIME Hub course repository.

 

Solution: Data Transformation

1-2) Download the sales.csv file from the data folder on the KNIME Hub. Read the file with the File Reader node. Use the Rule Engine node with the following expression:


$country$ = "USA" => "USD"

$country$ = "Germany" => "EUR"

3) Continue the workflow with another Rule Engine node, and write the following expression:

$currency$ = "EUR" => 1

$currency$ = "USD" => 0.88

4) Use next a Math formula node with the following expression:

$amount$ * $conversion$

Remember to append the results as new columns!

5) Use the String Manipulation node with the following expression:

replace($product$, "prod_", "")

6-8) Start a new workflow branch, and write the following expressions in the Column Expressions node:

  • Create the amount-in-EUR column:

currency = "USD"

if (column("country") == "Germany")

currency = "EUR"

conversion = 0.88

if (currency == "EUR")

conversion = 1

amount_in_EUR = column("amount") * conversion

  • Create the country_code column:

upperCase(substr(column("country"), 0, 3))

  • Manipulate the card column:

card = "Y"

if (isMissing(column("card")))

card = "N"

Solution workflow 09_Transformation - Solution in the KNIME Hub course repository

 

Data Preparation for Time Series Analysis

Here we introduce operations and techniques that can help you with dates, times, signal processing, and general data preparation for time series analysis. The Lag Column node is a very simple and yet a powerful node, since it allows you to do time series analysis. Strictly speaking, all it does is make a copy of the selected input data columns and shifts the cells down n number of steps. However, if the data cells are ordered in time, moving them n steps down is equivalent to moving them into the future. The result is a new column representing past values.

 

A reference workflow Build Vector of Past Samples is available on the KNIME Hub.

 

Exercise: Lag Column

1) Read the file clustered hourly_values_all.csv available in the data folder on the KNIME Hub

2) Replace the row ID values in the dataset with the values in the “row ID” column

3) Transform the “row ID” column into a Date&Time type, and append these values as a new column in the dataset

4) Sort the data in ascending order according to the column of type Date&Time

5) Filter the data to only include one of the clustered time series

6) Use the Lag Column node to build the vector of 10 past samples for each time, t

7) Plot the time series and the past 10 samples for each time, t

Empty exercise workflow 10_Lag_Column in the KNIME Hub course repository

 

Solution: Lag Column

1-3) Download the clustered hourly_values_all.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node. Use the RowID node, check the “Replace RowID with selected column values or create a new one” option, and select the “row ID” column in the “New RowID column” menu. 

3-4) Use the String to Date&Time node, select Date&time as the new type, and write “yyyy-MM-dd'_'HH” as the date format. Use next the Sorter node, and sort the data in ascending order according to the newly generated Date&Time column. 

5) Use the Column Filter node, exclude other columns but the Date&Time column and one cluster column, for example cluster_3

6) Use the Lag Column node, select the cluster column, and set the lag to 10 and lag interval to 1

7) Use the Line Plot node, select the Date&Time column as the x-axis column. Select the cluster column and all lag columns as the y-axis columns. 

Solution workflow 10_Lag_Column - Solution in the KNIME Hub course repository

 

Data Visualization

Data exploration helps you to better understand your task and data. Here you’ll learn how to meaningfully visualize and inspect data in interactive dashboards.

Interactive Univariate Visual Exploration

There’s a lot you can already learn from your data by simply exploring the statistical properties of the columns. The Data Explorer node produces an interactive view displaying some statistical measures of the data. In this view you can apply your domain expertise and remove irrelevant columns.

 

A reference workflow Univariate Visual Exploration with Data Explorer node is available on the KNIME Hub.

Note! The KNIME workflow in this video contains data from WeatherUnderground.com, from the Austin KATT station, which is released under GPLv2.

Data source: https://www.wunderground.com/history/airport/KATT/

Data license: https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

 

Exercise: Data Explorer

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Inspect the properties of the data with the Data Explorer node. How many different education levels are represented in the data?

3) In the interactive view, exclude the columns containing missing values. Which of the columns contain missing values? How many missing values each? 

Note! The Data Explorer node is part of the KNIME JavaScript Views (Labs) Extension

Empty exercise workflow 11_Data_Explorer in the KNIME Hub course repository.

 

Solution: Data Explorer

Download the adult.csv file from the data folder on the KNIME Hub, read the file with the File Reader node, and use next the Data Explorer node. In its interactive view output you can see that 16 different education levels are represented in the data. The columns containing missing values are native-country (583), occupation (1843), and workclass (1836). Exclude these columns by first selecting them in the view, and then clicking “Apply” and “Close”.

Solution workflow 11_Data_Explorer - Solution in the KNIME Hub course repository.

 

Interactive Bivariate Visual Exploration

When you’re interested in the relationships between columns, you can do multivariate data exploration, like for example, plot columns in pairs, and calculate correlation metrics. Scatter plots, conditional boxplots, sunburst charts, and parallel coordinate plots are some examples of powerful visualizations for bivariate and multivariate analysis.

 

A reference workflow Bivariate Visual Exploration with a Scatter Plot is available on the KNIME Hub.

 

Exercise: Scatter Plot

1) Read the wine.csv file available in the data folder on the KNIME Hub

2) Assign colors to the rows based on the color of the wine

3) Draw a scatter plot of alcohol vs. density. Do you observe any particular relationship between these two columns? Switch to mouse mode “Select” in the interactive view, and select the outlier point(s) in the plot, that is the most distant data point(s) from the main cloud of points. Are the selected data points red wines or white wines?

Empty exercise workflow 12_Scatter_Plot in the KNIME Hub course repository.

 

Solution: Scatter Plot

Download the wine.csv file from the data folder on the KNIME Hub, and read the file with the File Reader node. Assign the colors with the Color Manager node. Use next the Scatter Plot node, and select “alcohol” and “density” as the x- and y-axis columns. There is a slight negative correlation between alcohol and density for both red and white wines. Red wines have slightly higher density. The most distant outlier point has a very high density and is a white wine.

Solution workflow 12_Scatter_Plot - Solution in the KNIME Hub course repository.

 

Composite Views

A composite view is an interactive dashboard composed of multiple simpler views. You can select and filter data in all views at the same time, and thus gain a more comprehensive view of your data. The dashboard consists of the interactive views of individual nodes inside a component.

 

A reference workflow Create an interactive dashboard in 3 steps: Netflix dataset is available on the KNIME Hub.

LinkedInTwitterShare