Before diving even deeper into the analysis, there is already a lot you can learn from your data by simply exploring the statistical properties of the different input columns.
By interacting with the Data Explorer view, the user can get insights on the columns statistical properties and can apply the domain expertise to remove irrelevant columns. The output data table of the Data Explorer node will contain only the columns that remained after the selection process in the node view.
The workflow used in this video is available on the public KNIME EXAMPLES server under
To install a KNIME Extension, follow instructions in this video: https://youtu.be/8HMx3mjJXiw
The KNIME workflow in this video contains data from WeatherUnderground.com, from the Austin KATT station, which is released under GPLv2.
Data source: https://www.wunderground.com/history/airport/KATT/
Data license: https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
- Read the adult.csv dataset.
- How many different education levels are represented in the data?
- In the interactive view, exclude the columns including missing values.
- Which of the columns contain missing values? How many missing values each?
In the data, there are 16 unique values in the “education” column. The “workclass” column contains 1,836 missing values, the “occupation” column contains 1,843 missing values and the “native-country” column contains 583 missing values. Exclude these column by first selecting them in the interactive view and then clicking “Apply” and “Close”. Check the output table that it does not contain these columns.