Just KNIME It!
Prove your KNIME knowledge and practice your workflow building skills by solving our weekly challenges.
Challenge 29: Comparing Distributions between Groups
Description: Imagine that you want to compare student test scores to find out whether there are any differences between the students' performances in 2020 (group 1) compared to 2019 (group 0). For example, you might want to find out whether there is an unusually high number of very good scores compared to the other year, which could be a sign of cheating. Each student participated in the same three tests and received three test scores (Score 1, Score 2, and Score 3). How similar are the distributions of the three scores between the two groups? Which score distribution differs the most? The output should contain a visualization of the conditional distributions and a statistical test for the equality of mean and variance between the groups.
Hint: Check out the verified components for visualization on the KNIME Hub.
Author: Maarit Widmann
Our solution will appear here next Tuesday. In the meantime, feel free to discuss your work on the KNIME forum or on social media using the hashtag #justknimeit.
Remember to upload your solution with tag justknimeit-29 to your public space on the KNIME Hub. To increase the visibility of your solution, also post it to this challenge thread on the KNIME forum.Go to the KNIME Forum
Just KNIME It! Leaderboard
KNIME community members are working hard to solve the latest "Just KNIME It!" challenge - and some of you have solved dozens of them already! Who are the KNIME KNinjas who have completed the most challenges? Click over to the leaderboard on the KNIME Forum to find out! How many challenges have you solved?
Description: In this challenge, you will do a little bit of logic building and produce the star triangle below. The output will be such that each row in the pattern will be in the corresponding row of the table.
P.S. do not hardcode the pattern.
Bonus Challenge: Do not use scripting nodes or components.
Author: Ali Marvi
Solution Summary: We started by creating a table containing one white space character. We then looped over the table and iteratively added star characters to the left of the string, so that the strings' length equals the current iteration value in each iteration and replaced the white space character by a star character. We collected all rows and lastly removed the first row from the table as it is a duplicate.
Solution Details: We started by creating a table using the Table Creator node containing one white space character. We then looped over the table using a Counting Loop Start node and a Loop End node that collects the intermediate results row-wise. In the Counting Loop Start node we set the number of loops to 7. Within the loop body, we used the String Manipulation node to build the star pattern. We iteratively added star characters to the left of the string, so that the strings' length equals the current iteration value in each iteration (padLeft function) and replaced the white space character by a star character (replace function). Lastly, we used the Row Filter node to remove the first row from the table as it is a duplicate.
Description: You have a dataset containing information on US citizens who donated blood in the last year, including addresses and blood types. The O- blood type, also known as "universal donor", is perhaps the most valuable blood in the world because it can be transfused to nearly any blood type holder. Your goal here is to help a group of researchers find the number of citizens with O- blood type per US state. Unfortunately, the address column comes in a single line, so to extract the state information you will have to perform some data wrangling. They also asked you to create a choropleth map of the US to visualize the results.
Author: Ahmad Varasteh
Solution Summary: After reading the dataset in rare blood types, we first extracted the state from the address line. We then removed the unnecessary address column and renamed the name and the state column for better readability. We then filtered the data to only keep samples where the blood type equals "O-" and grouped the data by state. Lastly, to visualize the data in a world map we used the Coropleth Map component.
Solution Details: After using the CSV Reader node to read the dataset on rare blood types, we used the Regex Split node to extract the state from the address line into a separate column called "split_0". With the help of the Table Manipulator node, we removed the address column and renamed the columns “name” and “split_0”. Then, we used the Row Filter node to only keep samples where the blood type equals "O-". To calculate the number of citizens per state we used the GroupBy node, grouped by state, and used Count as an aggregation method. Lastly, we used the Coropleth Map component to visualize the results in a world map which resulted in an interactive map of the United States.
Description: To wrap up our series of data classification challenges, consider again the following churning problem: a telecom company wants you to predict which customers are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. The target class to be predicted in the test data is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). You have already found a good model for the problem and have already engineered the training data to increase the performance a bit. Now, your task is to communicate the results you found visually. Concretely, build a dashboard that:
shows performance for both classes (you can focus on any metrics here, e.g., precision and recall)
ranks features based on how important they were for the model
Author: Aline Bessa
Solution Summary: To tackle this challenge, we created (1) an interactive visualization to compare different performance metrics for both classes; (2) used the Global Feature Importance verified component to check which features were most indicative of churning for the test set; and (3) created a component based on the Local Explanation View verified component to better understand why our model generated a given false positive (the user did not churn but the model predicted churning) and a given false negative (the user churned but the model did not predict it).
Solution Details: After oversampling the minority class in the training data, and using AutoML to find a suitable ML model for our problem, we created a component to interactively compare different metrics (e.g., recall and precision) for classes Churn = 0 and Churn = 1. This comparison is visual and is based on the Bar Chart node. Next, we used the Global Feature Importance verified component to check which features were most indicative of churning — the top 2 were DayMins (total number of calling minutes during the day) and CustServ Calls (number of calls placed to customer service). Finally, we created a component to visualize local explanations for one false negative and one false positive example. These explanations were generated with the Local Explanation View verified component. In the case of the false negative example, the desired class was Churn = 1, and in our execution we noticed that the instance’s value for Night Calls (billed cost for nighttime calls) was likely responsible for misclassification. As for the false positive, the desired class was Churn = 0 — here, the value for CustServ Calls probably contributed a lot for misclassification.
Description: In this challenge series, the goal is to predict which customers of a certain telecom company are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. Here, the target class to be predicted is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do).
After automatically picking a classification model for the task, you achieved an accuracy of about 95% for the test data, but the model does not perform uniformly for both classes. In fact, it is better at predicting when a customer will not churn (Churn = 0) than when they will (Churn = 1). This imbalance can be verified by looking at how precision and recall differ for these two classes, or by checking how metric Cohen’s kappa is a bit lower than 80% despite a very high accuracy. How can you preprocess and re-sample the training data in order to make the classification a bit more powerful for class Churn = 1? Note 1: Need more help to understand the problem? Check this blog post out. Note 2: This problem is hard: do not expect to see a major performance increase for class Churn = 1. Also, verifying if the performance increase is statistically significant will not be trivial. Still... give this challenge your best try!
Author: Aline Bessa
Solution Summary: To tackle this challenge, we again relied on our AutoML component to pick a suitable model, and also played with different strategies to oversample the minority class using our SMOTE node. We were able to increase Cohen's kappa to 81.1% while maintaining basically the same accuracy we had before. The best solution we found again involved Gradient Boosted Trees -- this time, with 90 trees -- and using SMOTE with the 10 nearest neighbors to oversample the minority class. Note: We did not assess whether this improvement in Cohen's kappa is statistically significant.
Solution Details: After using two instances of the CSV Reader node to read both training and test datasets, we decided to use the SMOTE node to oversample the minority class in the training data. We then fed the oversampled data to the AutoML component and, just like last week, sent the best found model and the test data to the Workflow Executor node, which generated churn predictions. We then used the Scorer node to check how well the model performed over the test data. Since the SMOTE node has a parameter to control the number of nearest neighbors used to generate new samples, we decided to experiment with it. To this end, we used the Table Creator node to come up with a range of numbers of nearest neighbors, and then placed the body of this workflow (SMOTE/learning/predicting/scoring) in a loop that would generate solutions for these different numbers of nearest neighbors. The loop used the Table Row to Variable Loop Start and Loop End nodes. After executing this loop, we used the Top k Selector node to get the solution with the best Cohen's Kappa value, and then only kept informative columns associated with this solution by using the Column Filter node.
Description: Just like in last week’s challenge, a telecom company wants you to predict which customers are going to churn (that is, going to cancel their contracts) based on attributes of their accounts. One of your colleagues said that she was able to achieve a bit over 95% accuracy for the test data without modifying the training data at all, and using all given attributes exactly as they are. Again, the target class to be predicted is Churn (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). What model should you train over the training dataset to obtain this accuracy over the test dataset? Can this decision be automated? Note 1: A simple, automated solution to this challenge consists of a mix of 1 component and 4 nodes. Note 2: In this challenge, do not change the statistical distribution of any attribute or class in the datasets, and use all available attributes. Note 3: Need more help to understand the problem? Check this blog post out.
Author: Aline Bessa
Solution Summary: We used the AutoML verified component to experiment with a variety of classifiers (and hyperparameter values) in order to find the best suited model for this problem. The chosen model was Gradient Boosted Trees, in a technical tie with XGBoost Trees and H2O’s GBM. This model achieved an accuracy of 95.1% over the test data. Note: We would need more data than what we currently have to robustly verify if Gradient Boosted Trees is statistically better than the other models. Small variations on the training dataset, for example, could lead to a different model choice.
Solution Details: After reading the training and test datasets with two instances of the CSV Reader node, we used the AutoML verified component to detect the best suited model based on the training dataset at hand. Next, we used the Workflow Executor node to apply the chosen model over the test data. Finally, we used the Scorer node to check how well the model classified the test instances.
Description: A telecom company wants you to predict which customers are going to churn (that is, are going to cancel their contracts) based on attributes of their accounts. To this end, you are expected to use a decision tree classifier. The company gives you two datasets (training and test), both with many attributes and the class ‘Churn’ to be predicted (value 0 corresponds to customers that do not churn, and 1 corresponds to those who do). You should train the decision tree classifier with the training data, and assess its quality over the test data (calculate the accuracy, precision, recall, and confusion matrix for example). Note 1: This challenge is a simple introduction to predictive problems, focusing on classification. You are expected to just apply a decision tree classifier (and get an accuracy of about 92%). A simple solution should consist of 5 nodes. Note 2: In this challenge, do not change the statistical distribution of any attribute or class in the datasets, and use all available attributes. Note 3: Need more help to understand the problem? Check this blog post out.
Author: Aline Bessa
Solution Summary: Using the learner-predictor paradigm, we trained a decision tree classifier over the training data and assessed its performance over the test data. When training the decision tree, we used Gini index as a metric for the quality of the decision tree, pruned it using the MDL method, and kept at least 6 records per node. By doing this, we achieved an accuracy of about 94%.
Solution Details: After reading the training and test datasets with two instances of the CSV Reader node, we used the Decision Tree Learner node to train a decision tree classifier, and the Decision Tree Predictor node to apply it over the test data in order to assess its performance. Finally, we used the Scorer node to check how well the model classified the test instances. Note: Decision tree models have a number of parameters that can be tuned in order to generate better models. We'll be discussing parameter tuning and model selection later in this series of challenges.
Description: In this challenge you will create visualizations to inspect different sexual orientation laws around the world, as reported by the State-Sponsored Homophobia report. The purpose of this report is to work as a useful tool for LGBTQIA+ human rights defenders. Here are a few questions that your visualizations should help answer:
1. In what decade did most countries decriminalize homosexuality?
2. Is the decriminalization of homosexuality becoming more common over time?
3. In what countries is same-sex marriage legal?
4. Is there a continent in which the legality of same-sex marriage is more common? And how about civil unions?
Author: Aline Bessa
Solution Summary: After reading the dataset on sexual orientation laws, we binned the temporal information associated with the decriminalization of homosexuality, and then grouped countries by period. This information was then used to analyze whether the decriminalization of homosexuality is becoming more common over time. The data indicates that the 20th and 21st centuries concentrate most of the decriminalizations -- in particular, the period from 1990 to 1999. We also used maps to better understand the current legality of same-sex marriages and civil unions. Both are still illegal in most of the world, and legality seems to be more common in Oceania and in the Americas.
Solution Details: After using the CSV Reader node to read the dataset for the challenge, we used the Rule Engine node to bin the decriminalization information per decade of the 20th century, and then used the GroupBy node to group countries by decriminalization period. Next, we used the Row Filter node to remove periods associated with no decriminalization, and then used the Top k Selector node to sort the grouped data: periods with more decriminalization ranked higher in this sorted dataset. We also used a sequence of Rule Engine and GroupBy nodes to group countries by larger periods of time -- the goal was to visualize less granular trends in decriminalization. Here, we needed to use the String to Number node in the beginning of the pipeline for formatting. Finally, we used the Column Expressions node to create different codes for countries, depending on the legality of same-sex marriages or civil unions in them. In the end, we sent these three different sources of information to the Visualize LGBTQIA+ Rights component, which implements bar plots for decriminalization periods and choropleth maps for the legality of same-sex marriages and civil unions across the world.
Description: You are interning for a travel agency that wants to know about their top spenders' eating habits. As a trial run, you are given a tiny dataset with 6647 rows containing information on 1011 unique citizens traveling with different purposes. In this challenge you will:
Find the top 10 participants spending the highest amount of money on eating.
Find out whether the people who spend the most money on eating are the same people who spend the most time eating.
Note: Sometimes the ending balance is more than the starting balance. Assume this was a mistake when calculating money spent.
Author: Ahmad Varasteh
Dataset: Eating Out Data in the KNIME Hub
Solution Summary: Using date and time manipulation, we were able to calculate the time spent per activity as well as the amount spent. We then aggregated our data and filtered only those entries that were related to eating. We then took the top 10 spenders and the top 10 time takers and compared their ids with a join. The results of the join showed that eating a lot does not necessarily relate to spending a lot of money on it.
Solution Details: Critical to this analysis was to change the checkInTime and checkOutTime column to the datetime type. To this end, we used the String to Date&Time node -- but note that we could also have changed the type of the column within the CSV Reader node by clicking on the Transformation tab and then selecting the appropriate type. Once we had our data in a date format, we were able to use the Date&Time Difference node. This node made it possible to calculate the difference between the in and out time. With the Math Formula node, we were able to subtract the balance columns to see how much money was spent on each visit. We noticed that some of these events led to monetary gains (which was probably a mistake, so we used the "abs" function to derive the absolute value of the spent amount). Next, we used the GroupBy node to get the sum of money and time spent per customer. Then we used the Nominal Value Row Filter to limit our data to eating events only. The Top k Selector node then allowed us to get a list of the top 10 money/time spenders. Finally, using the Joiner node we compared whether the participant ids matched. We noted there was little overlap, suggesting that spending more time eating does not necessarily mean that more money will be spent.
Description: You work for a hospital and they have data for each time a patient was seen. In this challenge, you will calculate the difference between each time a patient was seen excluding weekends (called "network days"). Once you calculate the network days, calculate the average network days per patient. For the challenge, experiment with the input and output below.
Patient Date Network Days Mean
Aline 11/01/2022 ? 23.333
Aline 12/02/2022 24 23.333
Aline 25/02/2022 10 23.333
Aline 15/04/2022 36 23.333
Victor 05/02/2022 ? 13.333
Victor 25/02/2022 15 13.333
Victor 15/03/2022 13 13.333
Victor 30/03/2022 12 13.333
Note: if you simply use the Date&Time Difference node, you will mix patient data/dates and will also end up counting weekends. Bonus Challenge: Create a solution without using loops.
Author: Victor Palacios
Solution Summary: First, we changed our string to datetime format so that we can calculate temporal differences. To avoid loops, we then used lag columns to tag when our patients change. Afterwards, we used another lag column to calculate the differences between days. Next, we used a component which we found in the KNIME Forum that allowed us to calculate network days using math formulas. Finally, we calculated averages using a groupby and then joined that with our data.
Solution Details: Transforming our data using String to Date&Time allows us to use operational nodes like Date&Time Difference. This was useful since the key to this challenge was to make use of the Lag Column node if you wanted to avoid loop. The Lag Column node generates a duplicate column that is off by one or more rows. This allows for direct comparison between one column's data and its previous data. That is, we were able to calculate the difference between when a patient was seen last by comparing their date seen and their next date seen. Not only that, but you can use the lag column to pinpoint when a patient changes using a Rule Engine which simply compares the current row with the lagged row. While some people used another programming language in their network day calculation, we found a component on the KNIME Forum which had all the mathematical rules for calculating network days. This component used several nodes like Date&Time Difference, Math Formula, and Rule Engine. To keep patient data from being mixed, we then used another Rule Engine to remove data calculated across patients. With the GroupBy node we were then able to find the mean number of times patients came in. Finally, we joined the grouped data with our original data and used the Column Filter node to show only relevant columns in the final output.
Description: A survey was done to ask employees to rate themselves on a scale of 1-5 on different technologies and skills. We would like you to analyze this data and present it in a list where we are providing primary, secondary, and tertiary skills. Primary skills can be defined as skills rated 5; secondary skills rated 3 and 4; and tertiary skills rated 2. The final list should appear as it does on the KNIME forum where this challenge was suggested. Note: There may be cases in which employees have not rated themselves more than 3 on any of the skills (i.e., they have no primary skills).
Dataset: Skill Survey Data in the KNIME Hub
Solution Summary: We started our solution by restructuring the input dataset, turning the skill columns into values of a single column. Next, we filtered the data to only keep the primary skills (rated 5) for each employee. We did something similar for secondary (rated 3 or 4) and tertiary skills, and then joined this information at the end. This led to a dataset in which it is possible to quickly see all primary, secondary, and tertiary skills of each employee.
Solution Details: After reading the challenge’s dataset with the Excel Reader node, we used the Unpivoting node to turn each skill column into a value of a single column. This facilitated the grouping of skills per employee. To perform this grouping for primary skills, for example, we used (1) the Range Filter node to only keep the employees’ skills that were rated 5, (2) the GroupBy node to concatenate all employees’ primary skills into a single row, and then (3) the Column Rename node to rename the concatenated column as “Primary Skills”. We used the same sequence of nodes for secondary and tertiary skills, ending up with columns “Secondary Skills” and “Tertiary Skills” respectively. We then used two Joined nodes to join all skill tables into a single dataset, such that all primary, secondary, and tertiary skills of each employee are listed in a single row.
Description: A common problem in mechanical and medical data is associating notes with a category. In this challenge, you will automate the process of sorting mechanical notes into their correct category. Here’s an example:
--List of Categories--
1. The product was defective.
2. A crack was caused by client.
3. Many scratches noted.
1. The product was defective. Defect
2. A crack was caused by client. Crack
3. Many scratches noted. Scratch
Don't worry about using fancy machine learning or natural language processing models. This problem can be handled reasonably well using a total of 5 nodes (simple solution), and a more refined solution involves just 8 nodes (complex solution). Also don't worry about getting 100% accuracy.
Author: Victor Palacios
Solution Summary: After reading the inspection notes and the categories, a simple solution consisted of running a similarity search between categories and inspection notes to find which of the former best corresponded to each of the latter. A more complex solution involved lowercasing both categories and notes to improve matching, and then running a regular expression matcher to find all categories that correspond to each note (instead of just one category). Note: We did not implement any spellchecking in our solution, which would further increase matching quality.
Solution Details: (Simple solution) After reading the inspection notes and categories with two Excel Reader nodes, we used the Similarity Search node to find the category that best matched each note. Next, we used the Joiner node to actually perform the match between most similar categories and notes, and then used the Column Filter node to remove temporary columns. (Complex Solution) We started by reading and lowercasing the inspection notes and categories with two Excel Reader and two String Manipulation nodes. Note that lowercasing both inputs was a cheap way of finding more matches between categories and notes. Next, we used the GroupBy node to create a concatenated regular expression containing all categories, and used the Table Row to Variable node to convert this regular expression into a variable. We then used shared component Regex Find All to find all categories that corresponded to each inspection note, and finally used Split Collection Column node to put each matched category into a separate column.
Description: Housing prices in America are a bit out of control. In this challenge, you will see how housing prices have changed through time. Here are the tasks for this challenge:
Read the dataset for this challenge directly from its link, without downloading any files to your local machine with the CSV Reader node.
Using monthly data, calculate the yearly data and visualize the 3 RegionNames with the most drastic changes. Note: Consider the year the data was collected until the most recent year to calculate change; feel free to ignore missing values.
Find the state (regardless of region) with the lowest prices for homes on average currently.
According to Wikipedia, "The bankruptcy of Lehman Brothers on September 15, 2008 was the climax of the subprime mortgage crisis.". What visualizations can show the effect this had on the housing market? We used our answer to question 2 to keep our visualizations clutter-free.
Author: Victor Palacios
Solution Summary: After reading the housing prices dataset, and reshaping it such that each row had a region, price value, and date, we created three components to answer questions 2 to 4. To answer question 2, we grouped the dataset by region and calculated the mean housing price per year. Next, we looped over the regions getting the housing price differences between the first and the last years in the dataset. We then selected the 3 regions with the most drastic changes and plotted them with a bar chart. To answer question 3, we got the mean housing price per state in the last year and then selected the state that currently has the lowest average housing price. Finally, to answer question 4 we got the average housing price per year for the 3 regions that changed the most (output of question 2), and then plotted the prices as line plots.
Description: You were asked to split a single sales CSV file into smaller ones based on groups, named according to the group names. As an example, if the original file had the following data:
You would generate two files: one named Group_a, and one named Group_b. They would have the following structure:
Your solution to this task should be generic — that is, it should work for any number of groups, and the names of the groups should not matter.
Author: Victor Palacios
Solution Summary: We addressed this challenge by looping over the sales groups, and by writing smaller CSV files with all the sales information associated to each group.
Solution Details: After reading the sales file with the CSV Reader node, we used the Group Loop Start node to start iterating over the groups information. Next, we used the Table Row to Variable node to get each group at a time as a flow variable. We then used the String Manipulation (Variable) node to create filenames using each group as part of the name. Next, we used the String to Path (Variable) node to create relative path locations for the filenames, and finally wrote the smaller CSV files into the path locations. We closed our loop with the Variable Loop End node.
Description: Given a text-based PDF document with a table, can you partially extract the table into a KNIME data table for further analysis? For this challenge we will extract the table from this PDF document and attempt to partially reconstruct it within KNIME. The corresponding KNIME table should contain the following columns: Day, Max, Min, Norm, Depart, Heat, and Cool. Note 1: Your final output should be a table, not a single row with all the relevant data. Note 2: The Tika Parser node is better suited for this task than the PDF Parser node. We completed this task without components, regular expressions, or code-snippet nodes. In fact, our solution has a total of 10 nodes, but labeling the columns required a bit of manual effort.
Author: Victor Palacios
Solution Summary: After reading the content of the PDF with one of KNIME’s OCR functionalities, we processed it in order to get the table data in the right format (cell by cell, and with the right number of columns and rows), and also to extract the table’s columns’ names. We then re-formatted the columns’ names, merged it with the table’s body, and removed those columns that were not specified in the challenge.
Solution Details: We started our solution by using the Tika Parser node to read the given PDF file. Next, we used the Cell Splitter node to break the content into newlines, which already helps in the separation of the table’s rows. Since the output of the Cell Splitter node is a table with a single, very long row (with each cell corresponding to a line), we used the Transpose node to turn each column into a row, facilitating the postprocessing. Next, we used another instance of the Cell Splitter node to further split the data by space, further separating the cells that are present in the PDF table. This processing led to a table with messy Row IDs, so we used the RowID node to reset them. Next, we used the Row Filter node to isolate those rows that corresponded to the table in the PDF, and then used the Extract Table Spec node to get the table’s columns’ names. Finally, we used the Insert Column Header node to combine the output of the Row Filter node with the fixed columns’ names (copied and pasted within an instance of the Table Creator node), and then used the Column Filter node to remove those columns that were not specified in the challenge.
Description: You are working with audio recognition data tables such that each table’s initial and final rows contain zeroes. These beginning and ending zeroes are irrelevant noise that must be removed. Here is an example of such a data table:
Your goal is to only remove these unnecessary starting and trailing zeroes. You must keep the zeroes in the middle of the data, which do not constitute noise. As well, the position of the starting and trailing zeroes differs per data table, so a good solution probably requires flow variables for flexibility (although there may be other solutions which do not involve flow variables).
Author: Victor Palacios
Solution Summary: Our solution to this challenge was split into two parallel steps: first, we identified the RowID number of the first non-zero entry in the data; in a parallel branch of the workflow, we reversed the RowID numbers to find the one that corresponded to the last non-zero entry in the data. These two special RowID numbers, which were captured by two flow variables, were used to remove the unnecessary starting and trailing zeroes in the data.
Solution Details: After reading the example data for the challenge with the CSV Reader node, we used the String Manipulation node to create a temporary column named row_number, which contained the RowID numbers, and the String to Number node to convert the values in row_number into integers. This column came in handy later on in the solution, making the detection of rows with starting and trailing zeroes a bit more convenient. After this preprocessing, we used the Row Filter node to remove all rows whose data entries corresponded to zero, just to facilitate the detection of the first RowID number that was linked to a non-zero data entry. To perform this detection we used the Table Row to Variable node, which stored this RowID number into a flow variable. This flow variable and the output of the String to Number node were then passed to another instance of the Row Filter node, which combined the information to remove all data entries with starting zeroes. In parallel, the data with no zero entries (output of the first instance of the Row Filter node) was sorted to reverse the RowID numbers. Next, a second combo of Table Row to Variable and Row Filter nodes was used to identify the last RowID number tied to a non-zero data entry, and to remove all trailing zeroes from the data. Finally, we used the Column Filter node to remove the temporary row_number column.
Description: Your company keeps data related to online and onsite transactions in a tabular dataset with the following format:
Index Online Onsite
7 F5454 7736-01
In this challenge, you are asked to extract digits from the transactions (which are related to the bought products) given the following guidelines: (1) if the onsite transaction starts with “L”, then take its first 12 digits; otherwise, take its first 6 digits; and (2) if the onsite transaction has a missing value, then take the string from the online transaction.
What is the most efficient way to perform this task? For the example above, you should produce the following output column:
Author: Victor Palacios
Dataset: For this challenge, create the dataset illustrated above as input with the Table Creator node.
Solution Summary: To tackle this challenge we started by handling missing values, which help determine if column Online or Onsite should be used. Next, we implemented rules to extract the right number of digits from each onsite transaction. We generated a few temporary columns in the process, which we then removed at the end of our solution.
Description: You have been working for a Life Sciences company for a month as a a data wrangler. Several coworkers from the Biology department would like to obtain a list of human genes related to specific hormones, but they do not know how to use REST services, GET requests, etc. Your task is to use the REST service provided by MyGene.info to obtain a list of human genes related to a list of hormones provided to you by your coworkers. Next, you should parse the JSON response into a table that is easy to read.
For example, if you use "http://mygene.info/v3/query?q=summary:" and append "insulin", then your request would return a JSON structure with 10 hits -- each one of them with the following fields: "_id", "_score", "entrezgene", "name", "symbol", and "taxid".
You should then parse this JSON into a table with columns "_id", "_score", "entrezgene", "name", "symbol", and "taxid". If the list provided by your coworkers contains more than one hormone, all the parsed information should be aggregated into a single table. Also, sometimes your request may return a response in XML instead of JSON. How could you include a way to also parse XML responses?
Need a tip or two? See our youtube video on REST API.
Author: Victor Palacios
Solution Summary: To tackle this challenge, we started by reading a list of hormones and then, for each hormone, we formatted and executed a GET request. The user should then inspect the result, determine whether the responses were in XML or JSON, and then execute a component we created to control the execution of the rest of the workflow. We then implemented solutions to parse responses of both types (XML and JSON).
Solution Details: We started our solution by using the CSV Reader node to read the list of hormones. Next, we used the String Manipulation node to create a GET request URL per hormone, and then the GET Request node to execute them. We then created a component named Pick Body Type (JSON or XML) so that the user could control what parsing solution should be used over the GET response. If the user types 0, the JSON parsing is invoked; if 1, the XML parsing. For the JSON parsing, we used the JSON Path node followed by the Ungroup node to get row-wise records. For the XML parsing, we used the JSON to XML node to simulate an XML response (none of the hormones in the dataset returned XML) followed by the XPath node to directly obtain row-wise records.
Description: Your support team would like to get some insight into the quality of their service and they ask you to analyze their support data. As a first step, they would like to find out how the time to answer a support ticket is distributed, and what the mean of this distribution is. Help your support team by using the dataset included in this challenge to (1) plot this time distribution as a histogram, and to (2) calculate its mean.
Author: Kathrin Melcher
Solution Summary: Our solution to this challenge was pretty straightforward. After reading the ticket data, we calculated the difference between the creation and the closing timestamps for each ticket (in minutes). Next, we computed the mean time and the time distribution to close a ticket focusing on those tickets that were already closed.
Solution Details: We started by reading the data with the Table Reader node. Next, in order to compute time differences, we used the String to Date&Time node to convert creation and closing timestamps into Date&Time cells. We then used the Date&Time Difference node to calculate the time that it took to close each ticket — naturally, tickets that were still open have this value undefined. Since our focus was on tickets that have this value defined (closed tickets), we used the Row Splitter node to separate them. Next, we sent the closed tickets to the Math Formula and the Histogram nodes to calculate the mean time and the time distribution to close tickets respectively.
Description: Your coworker turns to you and says that she is going to retire. You laugh because she is 30 years old. She is serious. To understand how she got to this decision, you will create a KNIME component named “Financial Tracker_YOURNAME” (replace YOURNAME with your name). The component should use widgets to get the following input:
a person's monthly expenditure amount
their target age to retire
The output of the component should be how much money they need to have in order to retire at the target age. For simplicity, use this formula in your component:
amount_to_retire = (100 - target_age) * monthly_expenditure_amount * 12
Use your component to figure out if 2,000,000 dollars is enough for your coworker to retire, given that she spends 4,000 dollars per month. To keep this challenge simple, do not consider inflation, compounding interest, or part-time work in retirement.
Are you interested in how we came up with this formula? Check the Trinity study out. In this study, participants needed roughly 25 times whatever they spent yearly to survive for 30 years with a 95% success rate.
Author: Victor Palacios
Solution Summary: To solve this challenge, we created a component in which we first use widgets to capture monthly expense and target age for retirement. Next, we calculated the amount of money required to retire at the target age, given the monthly expense, following the formula given in the challenge. We used another widget to show the calculated value, and also added a “Recalculate” option for users to explore different retirement setups.
Solution Details: We started our solution (component Financial Tracker_KNIME) by using two Integer Widget nodes that capture monthly expense and target age for retirement. Through an interactive interface, users can input values for these two variables. These values are then sent to the Math Formula (Variable) node to calculate the amount of money required for retirement. The output of this node is then sent to a Text Output Widget node, which shows the calculated amount of money in an interactive view. To allow for recalculations, we also used the Refresh Button Widget node in this component.
Description: You would like to post a question on the KNIME forum, but you have confidential data that you cannot share. In this challenge you will create a workflow which removes (or transforms) any columns that reveal anything confidential in your data (such as location, name, gender, etc.). After that, you should shuffle the remaining columns' rows such that each numeric column maintains its original statistical distribution but does not have a relationship with any other column. Rename these columns as well, such that in the end of your workflow they do not have any specific meaning. Let's see an example:
Row Name Fav_Num Muscle_Mass
0 Victor 7 10
1 Aline 3 20
2 Scott 42 30
Row column column (#1)
0 3 30
1 42 10
2 7 20
Feel free to see our resources on data anonymization for inspiration, but note that the task here is much simpler! For reference, our solution only uses 7 nodes to anonymize and 3 additional nodes to do make sure the data truly was anonymized.
Author: Victor Palacios
Solution Summary: After reading the FIFA dataset, we removed all columns of type string because they may contain personally identifying information. Next, we looped over the datasets’ columns shuffling them one by one in a pseudo-randomized fashion. We then gathered the shuffled columns into a single table and anonymized its column identifiers by replacing them with meaningless names. We also compared the data before and after shuffling/anonymizing to make sure that the anonymization worked.
Solution Details: We started our solution by reading the FIFA dataset with the CSV Reader node, followed by the removal of all string columns with the Column Filter node. The goal here was to remove any information that could help users identify the players. Next, we used the Column List Loop Start node to begin looping over the datasets’ columns. Within the loop, we used the Table Row to Variable node to get the name of the column that was currently being processed, and then we sent it to the Target Shuffling node. At this point of the loop, the column’s values were randomly shuffled. We finished the loop with the Loop End (Column Append) node, which aggregates all the shuffled columns into a single table. To finish the anonymization, we used the Column Rename (Regex) node to replace the columns’ identifiers with meaningless names. To make sure that the anonymization worked — that is, that the order of values in columns was completely and randomly changed in the process —, we sent the data before and after anonymization to two Row Filter nodes. Here, we extracted their first columns and compared their values with the Table Difference Finder node. The output of this node shows how the original value order was destroyed in the data processing.
See our solution in the KNIME Hub
Description: The challenge requires you to recreate the Wordle bar chart for three players. If you have never played Wordle, check this brief introduction to the game. In this challenge, you are expected to do the following:
For the bar chart, each possible number of guesses (1 through 6) must be represented, as well as the number of victories by each player. Check this example with just a single player's data.
Transform the number of guesses [1, 6] into a score [1,10], but remember that a higher number of guesses should result in a lower score. A missing guess means that the player did not manage to guess the word and therefore should receive a 0 score.
Next, calculate the average of the three players. Who has the best average?
Now consider the difficulty of the words. Assign a weight W to the words in the list of the 306 most difficult words in the English language (file “difficult words.txt”), and recalculate the average of the three players for W=2 and for W=0. Has the player with the best score changed?
Author: Rosaria Silipo
Solution Summary: After reading the players’ data and the list of difficult words, we performed a left outer join between these two datasets using the words as keys. We used this joined data to plot the performance bar chart for the three users, unpivoting and pivoting the data on different columns first. We also used the joined data to compute the simple and the weighted average performance per player. In all cases, player 2 was the winner.
Solution Details: To tackle this challenge, we started by reading the players’ data with the Excel Reader node and the word difficulty data with the CSV Reader node. In both cases, we used the Constant Value Column node to set a standard weight (weight = 1) for words in the players’ data, and to set a specific weight for the difficult words (which can be weight = 0 or weight = 2). After lowercasing the words in the players’ data with the String Manipulation node, we left-outer-joined both datasets (Joiner node) using the words as keys. We started the plotting part of our solution by unpivoting the joined dataset on columns player 1, player 2, and player 3 (Unpivoting node). This gave us the numbers of word attempts per player, and we removed the rows with attempts missing with the Missing Value node. We then re-pivoted the data per player and per number of attempts, ending up with how many words each player guessed right per number of attempts (Pivoting Node). After some post-processing, which included replacing missing values in the pivoted table with 0, and setting different colors for each player, we plotted their performances in a bar chart (component Bar Chart in green). In parallel, we also used the joined data to compute the simple and the weighted average performance per player. First, we rescaled their numbers of word attempts to interval [0, 10], such that 0 and 10 were the worst and best scores respectively (component Rescale). Next, we executed the GroupBy node over the rescaled data to calculate the average player performance. Inside component Weighted Average, we processed the rescaled data a bit to use the correct weights depending on the difficulty of the words, and also used the GroupBy node to obtain the weighted average performance. In all cases, player 2 was the winner. Note: Execute the workflow with weight = 0 and weight = 2 for difficult words to obtain both required weighted averages. We did not add a configuration widget to parametrize weights to keep the solution simple.
In this challenge you will create one or more visualizations to inspect the percentual share of seats held by women in local government, as reported by the United Nations Development Programme: Human Development Reports. The purpose of this report is to “ensure women's full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life.” Since this is a real dataset there will be missing values, and we expect you to augment the data via your own creativity. In particular, we ask that you not only report country specific data but also continent data. The true challenge here will be how you transform this incomplete dataset into one or more visualizations, and which visualizations you deem appropriate for this challenge. Challenge outline is as follows:
Download the dataset and decide the most efficient way to bring it into KNIME
Add a continent column using whatever method you think is most efficient
Visualize the data (the KNIME solution will visualize both country and continent related data)
Author: Victor Palacios
Solution Summary: To tackle this challenge, we gathered data on continents and their countries and joined it with the dataset of the challenge, ending up with a dataset that had (1) a country, (2) its continent, and (3) the share of seats held by women per row. Next, we aggregated this dataset to visualize the participation of women in local governments per continent, with a bar chart. Finally, we used the original dataset of the challenge to visualize the participation of women in local governments per country with a choropleth map.
Solution Details: We started our solution by copying and pasting the dataset of the challenge into KNIME with the Table Creator node. The dataset was then immediately fed into the Enhanced Choropleth Map component, exposing how much women participate in local governments country-by-country. Next, we used the CSV Reader node to read a dataset that listed every country in each continent, and joined this data with the initial one with the Joiner node (the join keys were the countries). After that, we used the GroupBy node to aggregate the resulting dataset per continent, extracting the corresponding median share of seats held by women in government. We decided to use the median instead of the mean because the former is more robust against outliers, leading to a more faithful representation of how women participate in governments in each continent. We then post-processed the aggregated data by sorting it in descending order with the Sorter node, and by removing groups with missing values with the Nominal Value Row Filter node. These steps were necessary to generate an adequate bar chart visualization with the Bar Chart node, indicating the share of women participation in government per continent.
Description: A research group is running a study to understand how news consumption has changed lately. They run a survey on 3,000 people, asking where they usually read their news. Participants can indicate up to three options among “Social Media”, “Online Newspapers”, “Printed Newspapers”, and “Radio or Television”, recorded with a numeric code in a CSV file. The respective descriptive names are provided in a small table. Your task is (1) to replace the codes in the survey data with their corresponding descriptive names, and (2) to create an interactive dashboard that shows how many times each option has been selected. The dashboard should allow users to filter for the options they are interested in. Note: Feel free to use this challenge as an opportunity to explore different visualization nodes!
Author: Emilio Silvestri
Solution Summary: We started by reading the survey data and the code dictionary. Next, we processed and grouped the survey data to get how many times each news source was mentioned. We then replaced the codes in this grouped data with their corresponding descriptive names, sorted it based on news source popularity, and moved on to the visualizations. To this end, we used a component to create a dashboard that allows users to select which news sources they want to visualize, and then plotted the sources' frequencies as a donut chart and as a bar chart. Note: Both charts basically convey the same type of information and are redundant. We used them here for didactic purposes, making the dashboard a bit more complex from an implementation viewpoint but still simple as a baseline solution.
Solution Details: We started by reading the survey data with the CSV Reader node, and the code dictionary with the Table Reader node. In order to get how many times each news source is cited in the survey, we started by transforming string column Source into a List with the Cell Splitter node. Next, we used the Ungroup node to create one row per entry in the list, and then grouped the rows by source type with the GroupBy node. We then used the Cell Replacer node to replace the codes in the grouped data with their corresponding descriptive names, and finally sorted this data based on the news source popularity (ascending order). After these steps, we created a dashboard to filter and visualize the news sources' frequencies. To filter the news source options in the dashboard, we started by creating a list of all the different sources with the GroupBy node. Next, we transformed this list into a flow variable with the Table Row to Variable node and fed it into the Multiple Selection Widget node, which used the list as options for visualization. The selection list output by this node was then joined with the news sources' frequencies (input of the dashboard) by using the Joiner node. We then plotted the frequencies for the selected news sources using the Plot/Donut chart, after making it colorblind-friendly with the Color Manager node. Finally, we also plotted these frequencies with the Bar Chart node for didactic purposes.
Description: In the accounting firm you work for, you are given contracts which were executed on different dates. Your goal is to create a method that can label each of these contracts with their corresponding fiscal year. Note: A fiscal year is a period of 12 months that is used in government accounting, usually for budget purposes and financial reporting. Its definition varies from country to country, so your solution should be flexible and include flow variables for both the start and the end dates of the fiscal year. As an example, the federal fiscal year in the United States is the 12-month period that begins on October 1st and ends on September 30th of the following year.
Author: Victor Palacios
Solution Summary: We started by reading the contract dates and the lookup tables with the defined fiscal years. Next, for each fiscal year, we selected those contracts whose dates fell within it and labeled them accordingly. The process of identifying which contracts fall within a given fiscal year was made flexible with the use of flow variables and the Date&Time-based Row Filter node. After all contract dates were labeled with their fiscal years, we sorted them by date to facilitate the understanding of the data.
Solution Details: We started by reading the contract dates and the lookup tables with the defined fiscal years with two Excel Reader nodes. Next, we converted the contract dates to Date&Time with the String to Date&Time node to facilitate the identification of contracts tied to a given fiscal year downstream. We then started looping over the fiscal years, generating flow variables for their start and end dates, with the Table Row to Variable Loop Start node. Inside the loop, these flow variables were used to select the contracts whose dates fell within their range. To this end, we used the Date&Time-based Row Filter node. Next, the Constant Value Column node was used inside the loop to label all selected contract dates with the current fiscal year label. We closed the loop with the Loop End node, which collected and concatenated the selected contracts at every iteration. We then finished the workflow by sorting the labeled data by contract date, facilitating interpretability. Note: In this solution, contracts that do not fall within the range of any fiscal year get filtered out.
Description: You are using KNIME to monitor the daily price of a product online. After using the Line Plot node to visualize the daily prices you have already gathered, you notice that they are often constant for a certain number of days before changing again. You want to create a new column in the price data you have at hand, named "Change", such that its value is 1 if a daily price changed with respect to the previous day, or 0 if it remained unchanged. For the first daily price in the data, the "Change" value should be 1. As an example, if the initial daily prices look like:
You should end up with data in the following format:
Date Price Change
2015-01-01 10 1
2015-01-02 10 0
2015-01-03 11 1
Author: Emilio Silvestri
Dataset: Daily Prices in the KNIME Hub
Solution Summary: To create column "Change" according to the challenge's description, we first created a temporary column, named "Price(-1)", holding the price value for the previous day (that is, a column just like "Price", but with a 1-day lag). Next, we compared the lagged and current prices to determine whether there was a change in value or not, leading to the creation of column "Change". Finally, we remove temporary column "Price(-1)" from the final dataset.
Solution Details: We started by reading the dataset with the CSV Reader node, and then plotted the values against the dates with the Line Plot node just to have an idea of how it looks. Next, we use the Lag Column node, which is the core of the solution, to create column "Price(-1)" based on column "Price": the former holds the values of the latter, but lagged by one day. With this new column at hand, we used the Rule Engine node to generate column "Change" by comparing the lagged values in "Price(-1)" with the current values in "Price". If the values were equal, the value in "Change" would be 0; otherwise, it would be 1. We then end the workflow by removing temporary column "Price(-1)" with the Column Filter node.
Description: You received the 2017 cancer data from the CDC for inspection, and your goal is to answer the following questions: (1) What are the top-5 most frequent cancer types occurring in females? (2) What are the top-5 most frequent cancer types occurring in males? (3) Which US state has the highest cancer incidence rate (that is, the highest number of cancer cases normalized by the size of its population)?
Author: Janina Mothes
Solution Summary: To find the top-5 most frequent cancer types occurring in (fe)males in the US in 2017, we preprocessed the data to remove aggregated cancer sites that could lead to wrong counts, grouped the data by cancer type, and pivoted by sex. Next, we sorted the data to find the top-5 most frequent cancer types. As for the highest normalized incidence of cancer, we grouped the CDC data by state, read the states population data, and then joined these two datasets on state. We then normalized the number of cancer cases per state by the corresponding population, and sorted the resulting data to find the state with the largest incidence.
Solution Details: We started by reading the CDC data with the CSV Reader node, and noticed that a few cancer sites were aggregated (e.g., "All Invasive Cancer Sites Combined"). Since this could lead to a less refined understanding of what cancer types were most prominent, we filtered them out with a regular expression in the Row Filter node. For simplicity, we did not explore cancer site codes in our solution. Next, we used the Pivoting node to group the data by cancer sites and to pivot it by sex. This generated a table with a breakdown of cancer types and frequencies per sex. To answer the first question, we fed this table to the Sorter node and sorted it in descending order with respect to column Female+Sum(Count), which was previously created by the Pivoting node. We then used the Column Filter node, followed by the Row Filter node, to get the top-5 rows of the resulting table. This corresponds to the most common cancer sites (potentially including some aggregated sites) in females. The solution to the second question is very similar: after using the Pivoting node, we sorted the resulting table in descending order with respect to column Male+Sum(Count) with the Sorter node, and used the Column Filter and Row Filter nodes to get the top-5 most common cancers in males. To answer the third question, we used an Excel Reader node to read the 2017 US states population data, cleaned the names of the states with the String Manipulation node, aggregated the CDC data by state with the GroupBy node, used the Joiner node to combine the grouped CDC data with the US states population data, normalized the number of cancer cases per state by their population with the Math Formula node, and used the Sorter node to sort the resulting table in descending order with respect to the normalized cancer incidence. Finally, we used the Row Filter node to extract the first row of this sorted table, which corresponds to the US state with the highest normalized incidence of cancer. Note: Depending on how you clean and aggregate the data, you may obtain different results.
Description: A pharmaceutical company used to keep its sales data in a CSV file. They ask for your help to split the data into monthly CSV files and to save them into different subfolders, where each subfolder corresponds to a year. As an example, the data for January 2015 should be stored in a file named 2015/January.csv. In your solution, remember to save the files and subfolders in the workflow data folder.
Author: Emilio Silvestri
Datasets: Sales Data in the KNIME Hub
Solution Summary: This challenge required you to read the CSV file, extract the year and month of the entries, and then implement a loop that, by iterating through years and months, creates yearly subfolders and monthly sales files inside them.
Solution Details: We started by reading the CSV file using the CSV Reader node. Next, we used nodes String to Date&Time to convert the dates in the original file to the correct type, and Extract Date&Time Fields to extract the dates’ corresponding years and months. We then used the Group Loop Start node to iterate over groups of rows that had the same year and month. For each batch of rows, we (1) used nodes Number to String, Table Row to Variable, and String to Path (Variable) to convert their year into the string that is the name of their subfolder, (2) used node Create File/Folder Variables to create the file path corresponding to their month, inside the correct subfolder, and (3) wrote the batch of rows in a correctly located CSV file. We added the node Variable Loop End to collect the variables created in the loop. Note: remember to save the files and subfolders in your workflow data area.
Description: You are asked to build a framework that allows users to interactively visualize datasets of images and manually exclude those pictures that are not of their interest. To test your implementation, you use a dataset containing images from The Simpsons and assume that only Marge Simpson’s pictures are of interest. How would you implement this framework and filter out every image that is not Marge’s?
Author: Emilio Silvestri
Solution Summary: This challenge required you to read the images in the dataset folder and to display them on a composite view to manually select and then filter out images that were not Marge's.
Solution Details: We started by reading the images using the Folder List node and the Image Reader node. Next, we rendered them to SVG format with the Renderer to Image node and displayed the images with the Tile View node. We then opened the interactive, composite view of the Tile View node and manually selected all images that were not Marge’s by clicking on their corresponding tile views. The final node is a Row Filter that excludes all data rows that were selected in the previous composite view, leading to a clean data table. Note: Remember to accept the changes when closing the composite tile view, otherwise the data selection will not be carried on to the next node. Tip: If you have a KNIME server, you can call this workflow from a web browser through the KNIME Server WebPortal. In the web browser, the composite view will show up as a web page and the data selection will be automatically exported into the underlying workflow.