Lesson 1. Visualization

KNIME-Data-Wranglers-L2-Lesson1

Data exploration is the first step in understanding your task and dataset before proceeding with the analysis. In this lesson you’ll learn how to meaningfully visualize and inspect them in interactive dashboards.

This lesson includes exercises. The corresponding data files, solution workflows, and prebuilt, empty exercise workflows with instructions are available in the L2-DW KNIME Analytics Platform for Data Wranglers - Advanced folder in the E-Learning repository on the KNIME Hub.

Interactive Univariate Visual Exploration

There’s a lot you can already learn from your data by simply exploring the statistical properties of the columns.

The Data Explorer Node

The Data Explorer node produces an interactive view displaying some statistical measures of the data. In this view you can apply your domain expertise and remove irrelevant columns.

A reference workflow Univariate Visual Exploration with Data Explorer node is available on the KNIME Hub.

Note! The KNIME workflow in this video contains data from WeatherUnderground.com, from the Austin KATT station, which is released under GPLv2.

Data source: https://www.wunderground.com/history/airport/KATT/

Data license: https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

 

Exercise: Data Explorer

1) Read the adult.csv file available in the data folder on the KNIME Hub. The data are provided by the UCI Machine Learning Repository.

2) Inspect the properties of the data with the Data Explorer node. How many different education levels are represented in the data?

3) In the interactive view, exclude the columns containing missing values. Which of the columns contain missing values? How many missing values each? 

Note! The Data Explorer node is part of the KNIME JavaScript Views (Labs) Extension

Empty exercise workflow 01_Data_Explorer in the KNIME Hub course repository.

 

Solution: Data Explorer

Download the adult.csv file from the data folder on the KNIME Hub, read the file with the File Reader node, and use next the Data Explorer node. In its interactive view output you can see that 16 different education levels are represented in the data. The columns containing missing values are native-country (583), occupation (1843), and workclass (1836). Exclude these columns by first selecting them in the view, and then clicking “Apply” and “Close”.

Solution workflow 01_Data_Explorer - Solution in the KNIME Hub course repository.

 

Interactive Bivariate Visual Exploration

When you’re interested in the relationships between columns, you can do multivariate data exploration, like for example, plot columns in pairs, and calculate correlation metrics. Scatter plots, conditional boxplots, sunburst charts, and parallel coordinate plots are some examples of powerful visualizations for bivariate and multivariate analysis.

Scatter Plot

The scatter plot visualizes data by pairs of values and shows possible relationships between columns. If needed, you can add a third dimension by assigning a color to each pair with the Color Manager node.

A reference workflow Bivariate Visual Exploration with a Scatter Plot is available on the KNIME Hub.

Exercise: Scatter Plot

1) Read the wine.csv file available in the data folder on the KNIME Hub

2) Assign colors to the rows based on the color of the wine

3) Draw a scatter plot of alcohol vs. density. Do you observe any particular relationship between these two columns? Switch to mouse mode “Select” in the interactive view, and select the outlier point(s) in the plot, that is the most distant data point(s) from the main cloud of points. Are the selected data points red wines or white wines?

Empty exercise workflow 02_Scatter_Plot in the KNIME Hub course repository.

 

Solution: Scatter Plot

Download the wine.csv file from the data folder on the KNIME Hub, and read the file with the File Reader node. Assign the colors with the Color Manager node. Use next the Scatter Plot node, and select “alcohol” and “density” as the x- and y-axis columns. 

There is a slight negative correlation between alcohol and density for both red and white wines. Red wines have slightly higher density. The most distant outlier point has a very high density and is a white wine.

Solution workflow 02_Scatter_Plot - Solution in the KNIME Hub course repository.

Composite Views

A composite view is an interactive dashboard composed of multiple simpler views. You can select and filter data in all views at the same time, and thus gain a more comprehensive view of your data. The dashboard consists of the interactive views of individual nodes inside a component.

 

A reference workflow Create an interactive dashboard in 3 steps: Netflix dataset is available on the KNIME Hub.

LinkedInTwitterShare