Lesson 3. Data Cleaning

KNIME-Data-Wranglers-L1-Lesson3

Now that you’ve accessed the data from one or more sources, you probably need to clean and transform them a bit. These operations bring the data into the appropriate shape to prepare the data for further analysis, display information more clearly, and archive the data. Useful operations include row and column filtering, feature transformation and standardization, and probably more.

This lesson includes exercises, and the data files, solution workflows, and prebuilt, empty exercise workflows with instructions are available in the L1-DW KNIME Analytics Platform for Data Wranglers - Basics folder in the E-Learning repository on the KNIME Hub.

Jump to the following main sections:

Row and Column Filtering

Data Transformation and Standardization

Row and Column Filtering

A good start is to exclude unnecessary information. Rows can be removed, for example, if they are duplicates, have missing values, or if they just don’t belong to the subset of interest. Also uninformative or redundant columns can be removed.

Row Filters

Here we show you a few basic and advanced options for filtering rows: by a string pattern, numerical range, RowID, row number, and rows with missing values.

 

 

 

 

A reference workflow Row Filter is available on the KNIME Hub.

 

 

A reference workflow Advanced Row Filtering is available on the KNIME Hub.

 

Exercise: Basic Row Filtering

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Filter out rows where the marital status is missing

3) Extract rows where 

  • the marital status is divorced
  • the marital status is never married and age is between 20 and 40 (both included)
  • the workclass starts with "S"

Empty exercise workflow 06_Basic_Row_Filter in the KNIME Hub course repository

 

Solution: Basic Row Filtering

Download the adult.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node, and start a new branch with a Row Filter node for each filtering task. Select the column for filtering in the “Column to test” section, and define the filtering criterion as a string pattern, numeric range, or missing values match. Select the “Exclude rows by attribute value” option, when you filter rows by a missing values match.

Solution workflow 06_Basic_Row_Filter - Solution in the KNIME Hub course repository

 

Exercise: Advanced Row Filtering

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Extract rows where the marital status is divorced or separated using

  • Nominal Value Row Filter node
  • Reference Row Filter node

3) Extract rows where the marital status is never married and age is between 20 and 40 (both included) using the Rule-Based Row Filter node

Empty exercise workflow 07_Advanced_Row_Filter in the KNIME Hub course repository

 

Solution: Advanced Row Filtering

1-2) Download the adult.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node. Select the “marital-status” column in the Nominal Row Filter node, and only include the values “divorced” and “separated”. 

Start a new workflow branch from the File Reader node, and connect it to the top input of a Reference Row Filter node. Create a reference table that contains a column with two values “divorced” and “separated”. Use this table as the bottom input of the Reference Row Filter node. 

3) Start again a new workflow branch from the File Reader node, and connect it to a Rule-based Row Filter node. Write the following expression in its expression editor: 

$marital-status$ = "Never-married" AND ($age$ >= 20 AND $age$ <= 40) => TRUE

Check the “Include TRUE matches” option.

Solution workflow 07_Advanced_Row_Filter - Solution in the KNIME Hub course repository.

 

Column Filters

You can select columns manually, by a string pattern matching the column name, or by the data type./p>

 

 

Reference workflows Column Filter and Workflow to show Column Filter examples are available on the KNIME Hub.

 

Exercise: Column Filtering

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Remove the marital status column

3) Remove all but the marital status column

4) Remove all but string columns

5) Remove all but the marital status column using the Reference Column Filter node

Empty exercise workflow 08_Column_Filter in the KNIME Hub course repository.

 

Solution: Column Filtering

1-3) Download the adult.csv file from the data folder on the KNIME Hub. Read the data with the File Reader node. Use the Column Filter node, and only include the marital status column in its manual selection tab. Start a new branch from the File Reader node with another Column Filter node, and only exclude the marital status column. 

4) Start again a new workflow branch from the File Reader node with a Column Filter node. Select string in its type selection tab.

5) Start a new workflow branch from the File Reader node, and connect it to the top port of a Reference Column Filter node. Use a Table Creator node to create a reference table that contains a “marital-status” column (the table can be empty). Use the reference table as the bottom input of the Reference Column Filter node.

Solution workflow 08_Column_Filter - Solution in the KNIME Hub course repository.

Data Transformation and Standardization

Before you move on to analyzing or reporting data, you’ll need to make sure that the data are in the correct form: the data types are correct, data are normalized where necessary, string values trimmed, numeric values rounded, and new features generated from existing ones where it’s useful.

Data Manipulation: Numbers, Strings, and Rules

When you apply data transformation operations, these very versatile nodes come in handy: the String Manipulation, Math Formula, and Rule Engine nodes. The first video below shows their similarities and differences. The second video shows how you can perform multiple data manipulation tasks at once with the Column Expressions node.

 

 

Reference workflows String Manipulation, Math Formula, and Rule Engine Example and Column Expressions for Data Manipulation are available on the KNIME Hub.

 

Exercise: Data Transformation

1) Read the sales.csv file available in the data folder on the KNIME Hub

2) Create a new column “currency” with value “USD” for the orders from the USA, and “EUR” for the orders from Germany

3) Create a new column “conversion” with value 1 if currency is “EUR”, and 0.88 if currency is “USD” (we refer to the exchange rate of Nov-04-2018)

4) Calculate values in a new column named “amount-in-EUR” by multiplying the value in column “amount” by the value in column “conversion”

5) The product names are of type “prod_1”. Change the product names to “1”, “2”, and so on.

6) Calculate the amount in euros with the Column Expressions node

7) Write an expression that extracts the first three letters of country names and converts them to upper case letters. Append a new column and name it “Country_Code”.

8) Write an expression that assigns the value “N” to the missing values in the “card” column. Replace the “card” column.

Empty exercise workflow 09_Transformation in the KNIME Hub course repository.

 

Solution: Data Transformation

1-2) Download the sales.csv file from the data folder on the KNIME Hub. Read the file with the File Reader node. Use the Rule Engine node with the following expression:


$country$ = "USA" => "USD"

$country$ = "Germany" => "EUR"

3) Continue the workflow with another Rule Engine node, and write the following expression:

$currency$ = "EUR" => 1

$currency$ = "USD" => 0.88

4) Use next a Math formula node with the following expression:

$amount$ * $conversion$

Remember to append the results as new columns!

5) Use the String Manipulation node with the following expression:

replace($product$, "prod_", "")

6-8) Start a new workflow branch, and write the following expressions in the Column Expressions node:

  • Create the amount-in-EUR column:

currency = "USD"

if (column("country") == "Germany")

currency = "EUR"

conversion = 0.88

if (currency == "EUR")

conversion = 1

amount_in_EUR = column("amount") * conversion

  • Create the country_code column:

upperCase(substr(column("country"), 0, 3))

  • Manipulate the card column:

card = "Y"

if (isMissing(column("card")))

card = "N"

Solution workflow 09_Transformation - Solution in the KNIME Hub course repository.

 

Data Manipulation: Date&Time Values

Date and time values have their dedicated data type Date&Time. Date&Time values can be filtered, aggregated, and manipulated with the specific Date&Time manipulation nodes. In the videos below we show you a few examples of them: how to change the data type from string to the dedicated data type Date&Time, how to extract granularities from a timestamp into multiple columns, and how to modify timestamps to show the relevant digits.

 

 

 

A reference workflow Manipulating Date&Time Values is available on the KNIME Hub.

 

Exercise: Date&Time Manipulation

1) Read the daily_sales.csv file available in the data folder on the KNIME Hub

2) Convert the order times from string to Date&Time. Replace the string column.

3) Add the date “2019-09-03” to all time values. Create a new column called “order_time_date”.

4) Extract the hours of the orders. Are there especially many orders occurring in a certain hour of the day?

Empty exercise workflow 10_DateTime_Manipulation in the KNIME Hub course repository.

 

Solution: Date&Time Manipulation

1-2) Download the daily_sales.csv file from the data folder on the KNIME Hub. Read the file with the File Reader node. Use the String to Date&Time node to convert the “order_time” column to Date&Time. You can save the individual configuration steps by pressing the “Guess data type and format” button in its configuration dialog. 

3) Use next the Modify Date node, and select “2019-09-03” in the field that activates, when you select the “Append date” option. Write “_time” as a suffix of appended columns. 

4) Continue the workflow with an Extract Date&Time Fields node, and check the “Hour” field. When you check the output table, you can find many orders between 2 p.m. and 3 p.m.

Solution workflow 10_DateTime_Manipulation - Solution in the KNIME Hub course repository.

 

LinkedInTwitterShare