Just KNIME It!
Prove your KNIME knowledge and practice your workflow building skills by solving our weekly challenges.
Challenge 18: Categorizing Notes
Description: A common problem in mechanical and medical data is associating notes with a category. In this challenge, you will automate the process of sorting mechanical notes into their correct category. Here’s an example:
--List of Categories--
1. The product was defective.
2. A crack was caused by client.
3. Many scratches noted.
1. The product was defective. Defect
2. A crack was caused by client. Crack
3. Many scratches noted. Scratch
Don't worry about using fancy machine learning or natural language processing models. This problem can be handled reasonably well using a total of 5 nodes (simple solution), and a more refined solution involves just 8 nodes (complex solution). Also don't worry about getting 100% accuracy.
Author: Victor Palacios
Solution Summary: After reading the inspection notes and the categories, a simple solution consisted of running a similarity search between categories and inspection notes to find which of the former best corresponded to each of the latter. A more complex solution involved lowercasing both categories and notes to improve matching, and then running a regular expression matcher to find all categories that correspond to each note (instead of just one category). Note: We did not implement any spellchecking in our solution, which would further increase matching quality.
Solution Details: (Simple solution) After reading the inspection notes and categories with two Excel Reader nodes, we used the Similarity Search node to find the category that best matched each note. Next, we used the Joiner node to actually perform the match between most similar categories and notes, and then used the Column Filter node to remove temporary columns. (Complex Solution) We started by reading and lowercasing the inspection notes and categories with two Excel Reader and two String Manipulation nodes. Note that lowercasing both inputs was a cheap way of finding more matches between categories and notes. Next, we used the GroupBy node to create a concatenated regular expression containing all categories, and used the Table Row to Variable node to convert this regular expression into a variable. We then used shared component Regex Find All to find all categories that corresponded to each inspection note, and finally used Split Collection Column node to put each matched category into a separate column.See our solution in the KNIME Hub
10 Challenge Club
Congratulations to the KNinjas who have aced 10 “Just KNIME It” challenges!
The 10 Challenge Club celebrates "Just KNIME It!" participants who have completed at least 10 challenges. How many challenges have you solved?
Description: Housing prices in America are a bit out of control. In this challenge, you will see how housing prices have changed through time. Here are the tasks for this challenge:
Read the dataset for this challenge directly from its link, without downloading any files to your local machine with the CSV Reader node.
Using monthly data, calculate the yearly data and visualize the 3 RegionNames with the most drastic changes. Note: Consider the year the data was collected until the most recent year to calculate change; feel free to ignore missing values.
Find the state (regardless of region) with the lowest prices for homes on average currently.
According to Wikipedia, "The bankruptcy of Lehman Brothers on September 15, 2008 was the climax of the subprime mortgage crisis.". What visualizations can show the effect this had on the housing market? We used our answer to question 2 to keep our visualizations clutter-free.
Author: Victor Palacios
Solution Summary: After reading the housing prices dataset, and reshaping it such that each row had a region, price value, and date, we created three components to answer questions 2 to 4. To answer question 2, we grouped the dataset by region and calculated the mean housing price per year. Next, we looped over the regions getting the housing price differences between the first and the last years in the dataset. We then selected the 3 regions with the most drastic changes and plotted them with a bar chart. To answer question 3, we got the mean housing price per state in the last year and then selected the state that currently has the lowest average housing price. Finally, to answer question 4 we got the average housing price per year for the 3 regions that changed the most (output of question 2), and then plotted the prices as line plots.
Description: You were asked to split a single sales CSV file into smaller ones based on groups, named according to the group names. As an example, if the original file had the following data:
You would generate two files: one named Group_a, and one named Group_b. They would have the following structure:
Your solution to this task should be generic — that is, it should work for any number of groups, and the names of the groups should not matter.
Author: Victor Palacios
Solution Summary: We addressed this challenge by looping over the sales groups, and by writing smaller CSV files with all the sales information associated to each group.
Solution Details: After reading the sales file with the CSV Reader node, we used the Group Loop Start node to start iterating over the groups information. Next, we used the Table Row to Variable node to get each group at a time as a flow variable. We then used the String Manipulation (Variable) node to create filenames using each group as part of the name. Next, we used the String to Path (Variable) node to create relative path locations for the filenames, and finally wrote the smaller CSV files into the path locations. We closed our loop with the Variable Loop End node.
Description: Given a text-based PDF document with a table, can you partially extract the table into a KNIME data table for further analysis? For this challenge we will extract the table from this PDF document and attempt to partially reconstruct it within KNIME. The corresponding KNIME table should contain the following columns: Day, Max, Min, Norm, Depart, Heat, and Cool. Note 1: Your final output should be a table, not a single row with all the relevant data. Note 2: The Tika Parser node is better suited for this task than the PDF Parser node. We completed this task without components, regular expressions, or code-snippet nodes. In fact, our solution has a total of 10 nodes, but labeling the columns required a bit of manual effort.
Author: Victor Palacios
Solution Summary: After reading the content of the PDF with one of KNIME’s OCR functionalities, we processed it in order to get the table data in the right format (cell by cell, and with the right number of columns and rows), and also to extract the table’s columns’ names. We then re-formatted the columns’ names, merged it with the table’s body, and removed those columns that were not specified in the challenge.
Solution Details: We started our solution by using the Tika Parser node to read the given PDF file. Next, we used the Cell Splitter node to break the content into newlines, which already helps in the separation of the table’s rows. Since the output of the Cell Splitter node is a table with a single, very long row (with each cell corresponding to a line), we used the Transpose node to turn each column into a row, facilitating the postprocessing. Next, we used another instance of the Cell Splitter node to further split the data by space, further separating the cells that are present in the PDF table. This processing led to a table with messy Row IDs, so we used the RowID node to reset them. Next, we used the Row Filter node to isolate those rows that corresponded to the table in the PDF, and then used the Extract Table Spec node to get the table’s columns’ names. Finally, we used the Insert Column Header node to combine the output of the Row Filter node with the fixed columns’ names (copied and pasted within an instance of the Table Creator node), and then used the Column Filter node to remove those columns that were not specified in the challenge.
Description: You are working with audio recognition data tables such that each table’s initial and final rows contain zeroes. These beginning and ending zeroes are irrelevant noise that must be removed. Here is an example of such a data table:
Your goal is to only remove these unnecessary starting and trailing zeroes. You must keep the zeroes in the middle of the data, which do not constitute noise. As well, the position of the starting and trailing zeroes differs per data table, so a good solution probably requires flow variables for flexibility (although there may be other solutions which do not involve flow variables).
Author: Victor Palacios
Solution Summary: Our solution to this challenge was split into two parallel steps: first, we identified the RowID number of the first non-zero entry in the data; in a parallel branch of the workflow, we reversed the RowID numbers to find the one that corresponded to the last non-zero entry in the data. These two special RowID numbers, which were captured by two flow variables, were used to remove the unnecessary starting and trailing zeroes in the data.
Solution Details: After reading the example data for the challenge with the CSV Reader node, we used the String Manipulation node to create a temporary column named row_number, which contained the RowID numbers, and the String to Number node to convert the values in row_number into integers. This column came in handy later on in the solution, making the detection of rows with starting and trailing zeroes a bit more convenient. After this preprocessing, we used the Row Filter node to remove all rows whose data entries corresponded to zero, just to facilitate the detection of the first RowID number that was linked to a non-zero data entry. To perform this detection we used the Table Row to Variable node, which stored this RowID number into a flow variable. This flow variable and the output of the String to Number node were then passed to another instance of the Row Filter node, which combined the information to remove all data entries with starting zeroes. In parallel, the data with no zero entries (output of the first instance of the Row Filter node) was sorted to reverse the RowID numbers. Next, a second combo of Table Row to Variable and Row Filter nodes was used to identify the last RowID number tied to a non-zero data entry, and to remove all trailing zeroes from the data. Finally, we used the Column Filter node to remove the temporary row_number column.
Description: Your company keeps data related to online and onsite transactions in a tabular dataset with the following format:
Index Online Onsite
7 F5454 7736-01
In this challenge, you are asked to extract digits from the transactions (which are related to the bought products) given the following guidelines: (1) if the onsite transaction starts with “L”, then take its first 12 digits; otherwise, take its first 6 digits; and (2) if the onsite transaction has a missing value, then take the string from the online transaction.
What is the most efficient way to perform this task? For the example above, you should produce the following output column:
Author: Victor Palacios
Dataset: For this challenge, create the dataset illustrated above as input with the Table Creator node.
Solution Summary: To tackle this challenge we started by handling missing values, which help determine if column Online or Onsite should be used. Next, we implemented rules to extract the right number of digits from each onsite transaction. We generated a few temporary columns in the process, which we then removed at the end of our solution.
Description: You have been working for a Life Sciences company for a month as a a data wrangler. Several coworkers from the Biology department would like to obtain a list of human genes related to specific hormones, but they do not know how to use REST services, GET requests, etc. Your task is to use the REST service provided by MyGene.info to obtain a list of human genes related to a list of hormones provided to you by your coworkers. Next, you should parse the JSON response into a table that is easy to read.
For example, if you use "http://mygene.info/v3/query?q=summary:" and append "insulin", then your request would return a JSON structure with 10 hits -- each one of them with the following fields: "_id", "_score", "entrezgene", "name", "symbol", and "taxid".
You should then parse this JSON into a table with columns "_id", "_score", "entrezgene", "name", "symbol", and "taxid". If the list provided by your coworkers contains more than one hormone, all the parsed information should be aggregated into a single table. Also, sometimes your request may return a response in XML instead of JSON. How could you include a way to also parse XML responses?
Need a tip or two? See our youtube video on REST API.
Author: Victor Palacios
Solution Summary: To tackle this challenge, we started by reading a list of hormones and then, for each hormone, we formatted and executed a GET request. The user should then inspect the result, determine whether the responses were in XML or JSON, and then execute a component we created to control the execution of the rest of the workflow. We then implemented solutions to parse responses of both types (XML and JSON).
Solution Details: We started our solution by using the CSV Reader node to read the list of hormones. Next, we used the String Manipulation node to create a GET request URL per hormone, and then the GET Request node to execute them. We then created a component named Pick Body Type (JSON or XML) so that the user could control what parsing solution should be used over the GET response. If the user types 0, the JSON parsing is invoked; if 1, the XML parsing. For the JSON parsing, we used the JSON Path node followed by the Ungroup node to get row-wise records. For the XML parsing, we used the JSON to XML node to simulate an XML response (none of the hormones in the dataset returned XML) followed by the XPath node to directly obtain row-wise records.
Description: Your support team would like to get some insight into the quality of their service and they ask you to analyze their support data. As a first step, they would like to find out how the time to answer a support ticket is distributed, and what the mean of this distribution is. Help your support team by using the dataset included in this challenge to (1) plot this time distribution as a histogram, and to (2) calculate its mean.
Author: Kathrin Melcher
Solution Summary: Our solution to this challenge was pretty straightforward. After reading the ticket data, we calculated the difference between the creation and the closing timestamps for each ticket (in minutes). Next, we computed the mean time and the time distribution to close a ticket focusing on those tickets that were already closed.
Solution Details: We started by reading the data with the Table Reader node. Next, in order to compute time differences, we used the String to Date&Time node to convert creation and closing timestamps into Date&Time cells. We then used the Date&Time Difference node to calculate the time that it took to close each ticket — naturally, tickets that were still open have this value undefined. Since our focus was on tickets that have this value defined (closed tickets), we used the Row Splitter node to separate them. Next, we sent the closed tickets to the Math Formula and the Histogram nodes to calculate the mean time and the time distribution to close tickets respectively.
Description: Your coworker turns to you and says that she is going to retire. You laugh because she is 30 years old. She is serious. To understand how she got to this decision, you will create a KNIME component named “Financial Tracker_YOURNAME” (replace YOURNAME with your name). The component should use widgets to get the following input:
a person's monthly expenditure amount
their target age to retire
The output of the component should be how much money they need to have in order to retire at the target age. For simplicity, use this formula in your component:
amount_to_retire = (100 - target_age) * monthly_expenditure_amount * 12
Use your component to figure out if 2,000,000 dollars is enough for your coworker to retire, given that she spends 4,000 dollars per month. To keep this challenge simple, do not consider inflation, compounding interest, or part-time work in retirement.
Are you interested in how we came up with this formula? Check the Trinity study out. In this study, participants needed roughly 25 times whatever they spent yearly to survive for 30 years with a 95% success rate.
Author: Victor Palacios
Solution Summary: To solve this challenge, we created a component in which we first use widgets to capture monthly expense and target age for retirement. Next, we calculated the amount of money required to retire at the target age, given the monthly expense, following the formula given in the challenge. We used another widget to show the calculated value, and also added a “Recalculate” option for users to explore different retirement setups.
Solution Details: We started our solution (component Financial Tracker_KNIME) by using two Integer Widget nodes that capture monthly expense and target age for retirement. Through an interactive interface, users can input values for these two variables. These values are then sent to the Math Formula (Variable) node to calculate the amount of money required for retirement. The output of this node is then sent to a Text Output Widget node, which shows the calculated amount of money in an interactive view. To allow for recalculations, we also used the Refresh Button Widget node in this component.
Description: You would like to post a question on the KNIME forum, but you have confidential data that you cannot share. In this challenge you will create a workflow which removes (or transforms) any columns that reveal anything confidential in your data (such as location, name, gender, etc.). After that, you should shuffle the remaining columns' rows such that each numeric column maintains its original statistical distribution but does not have a relationship with any other column. Rename these columns as well, such that in the end of your workflow they do not have any specific meaning. Let's see an example:
Row Name Fav_Num Muscle_Mass
0 Victor 7 10
1 Aline 3 20
2 Scott 42 30
Row column column (#1)
0 3 30
1 42 10
2 7 20
Feel free to see our resources on data anonymization for inspiration, but note that the task here is much simpler! For reference, our solution only uses 7 nodes to anonymize and 3 additional nodes to do make sure the data truly was anonymized.
Author: Victor Palacios
Solution Summary: After reading the FIFA dataset, we removed all columns of type string because they may contain personally identifying information. Next, we looped over the datasets’ columns shuffling them one by one in a pseudo-randomized fashion. We then gathered the shuffled columns into a single table and anonymized its column identifiers by replacing them with meaningless names. We also compared the data before and after shuffling/anonymizing to make sure that the anonymization worked.
Solution Details: We started our solution by reading the FIFA dataset with the CSV Reader node, followed by the removal of all string columns with the Column Filter node. The goal here was to remove any information that could help users identify the players. Next, we used the Column List Loop Start node to begin looping over the datasets’ columns. Within the loop, we used the Table Row to Variable node to get the name of the column that was currently being processed, and then we sent it to the Target Shuffling node. At this point of the loop, the column’s values were randomly shuffled. We finished the loop with the Loop End (Column Append) node, which aggregates all the shuffled columns into a single table. To finish the anonymization, we used the Column Rename (Regex) node to replace the columns’ identifiers with meaningless names. To make sure that the anonymization worked — that is, that the order of values in columns was completely and randomly changed in the process —, we sent the data before and after anonymization to two Row Filter nodes. Here, we extracted their first columns and compared their values with the Table Difference Finder node. The output of this node shows how the original value order was destroyed in the data processing.
See our solution in the KNIME Hub
Description: The challenge requires you to recreate the Wordle bar chart for three players. If you have never played Wordle, check this brief introduction to the game. In this challenge, you are expected to do the following:
For the bar chart, each possible number of guesses (1 through 6) must be represented, as well as the number of victories by each player. Check this example with just a single player's data.
Transform the number of guesses [1, 6] into a score [1,10], but remember that a higher number of guesses should result in a lower score. A missing guess means that the player did not manage to guess the word and therefore should receive a 0 score.
Next, calculate the average of the three players. Who has the best average?
Now consider the difficulty of the words. Assign a weight W to the words in the list of the 306 most difficult words in the English language (file “difficult words.txt”), and recalculate the average of the three players for W=2 and for W=0. Has the player with the best score changed?
Author: Rosaria Silipo
Solution Summary: After reading the players’ data and the list of difficult words, we performed a left outer join between these two datasets using the words as keys. We used this joined data to plot the performance bar chart for the three users, unpivoting and pivoting the data on different columns first. We also used the joined data to compute the simple and the weighted average performance per player. In all cases, player 2 was the winner.
Solution Details: To tackle this challenge, we started by reading the players’ data with the Excel Reader node and the word difficulty data with the CSV Reader node. In both cases, we used the Constant Value Column node to set a standard weight (weight = 1) for words in the players’ data, and to set a specific weight for the difficult words (which can be weight = 0 or weight = 2). After lowercasing the words in the players’ data with the String Manipulation node, we left-outer-joined both datasets (Joiner node) using the words as keys. We started the plotting part of our solution by unpivoting the joined dataset on columns player 1, player 2, and player 3 (Unpivoting node). This gave us the numbers of word attempts per player, and we removed the rows with attempts missing with the Missing Value node. We then re-pivoted the data per player and per number of attempts, ending up with how many words each player guessed right per number of attempts (Pivoting Node). After some post-processing, which included replacing missing values in the pivoted table with 0, and setting different colors for each player, we plotted their performances in a bar chart (component Bar Chart in green). In parallel, we also used the joined data to compute the simple and the weighted average performance per player. First, we rescaled their numbers of word attempts to interval [0, 10], such that 0 and 10 were the worst and best scores respectively (component Rescale). Next, we executed the GroupBy node over the rescaled data to calculate the average player performance. Inside component Weighted Average, we processed the rescaled data a bit to use the correct weights depending on the difficulty of the words, and also used the GroupBy node to obtain the weighted average performance. In all cases, player 2 was the winner. Note: Execute the workflow with weight = 0 and weight = 2 for difficult words to obtain both required weighted averages. We did not add a configuration widget to parametrize weights to keep the solution simple.
In this challenge you will create one or more visualizations to inspect the percentual share of seats held by women in local government, as reported by the United Nations Development Programme: Human Development Reports. The purpose of this report is to “ensure women's full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life.” Since this is a real dataset there will be missing values, and we expect you to augment the data via your own creativity. In particular, we ask that you not only report country specific data but also continent data. The true challenge here will be how you transform this incomplete dataset into one or more visualizations, and which visualizations you deem appropriate for this challenge. Challenge outline is as follows:
Download the dataset and decide the most efficient way to bring it into KNIME
Add a continent column using whatever method you think is most efficient
Visualize the data (the KNIME solution will visualize both country and continent related data)
Author: Victor Palacios
Solution Summary: To tackle this challenge, we gathered data on continents and their countries and joined it with the dataset of the challenge, ending up with a dataset that had (1) a country, (2) its continent, and (3) the share of seats held by women per row. Next, we aggregated this dataset to visualize the participation of women in local governments per continent, with a bar chart. Finally, we used the original dataset of the challenge to visualize the participation of women in local governments per country with a choropleth map.
Solution Details: We started our solution by copying and pasting the dataset of the challenge into KNIME with the Table Creator node. The dataset was then immediately fed into the Enhanced Choropleth Map component, exposing how much women participate in local governments country-by-country. Next, we used the CSV Reader node to read a dataset that listed every country in each continent, and joined this data with the initial one with the Joiner node (the join keys were the countries). After that, we used the GroupBy node to aggregate the resulting dataset per continent, extracting the corresponding median share of seats held by women in government. We decided to use the median instead of the mean because the former is more robust against outliers, leading to a more faithful representation of how women participate in governments in each continent. We then post-processed the aggregated data by sorting it in descending order with the Sorter node, and by removing groups with missing values with the Nominal Value Row Filter node. These steps were necessary to generate an adequate bar chart visualization with the Bar Chart node, indicating the share of women participation in government per continent.
Description: A research group is running a study to understand how news consumption has changed lately. They run a survey on 3,000 people, asking where they usually read their news. Participants can indicate up to three options among “Social Media”, “Online Newspapers”, “Printed Newspapers”, and “Radio or Television”, recorded with a numeric code in a CSV file. The respective descriptive names are provided in a small table. Your task is (1) to replace the codes in the survey data with their corresponding descriptive names, and (2) to create an interactive dashboard that shows how many times each option has been selected. The dashboard should allow users to filter for the options they are interested in. Note: Feel free to use this challenge as an opportunity to explore different visualization nodes!
Author: Emilio Silvestri
Solution Summary: We started by reading the survey data and the code dictionary. Next, we processed and grouped the survey data to get how many times each news source was mentioned. We then replaced the codes in this grouped data with their corresponding descriptive names, sorted it based on news source popularity, and moved on to the visualizations. To this end, we used a component to create a dashboard that allows users to select which news sources they want to visualize, and then plotted the sources' frequencies as a donut chart and as a bar chart. Note: Both charts basically convey the same type of information and are redundant. We used them here for didactic purposes, making the dashboard a bit more complex from an implementation viewpoint but still simple as a baseline solution.
Solution Details: We started by reading the survey data with the CSV Reader node, and the code dictionary with the Table Reader node. In order to get how many times each news source is cited in the survey, we started by transforming string column Source into a List with the Cell Splitter node. Next, we used the Ungroup node to create one row per entry in the list, and then grouped the rows by source type with the GroupBy node. We then used the Cell Replacer node to replace the codes in the grouped data with their corresponding descriptive names, and finally sorted this data based on the news source popularity (ascending order). After these steps, we created a dashboard to filter and visualize the news sources' frequencies. To filter the news source options in the dashboard, we started by creating a list of all the different sources with the GroupBy node. Next, we transformed this list into a flow variable with the Table Row to Variable node and fed it into the Multiple Selection Widget node, which used the list as options for visualization. The selection list output by this node was then joined with the news sources' frequencies (input of the dashboard) by using the Joiner node. We then plotted the frequencies for the selected news sources using the Plot/Donut chart, after making it colorblind-friendly with the Color Manager node. Finally, we also plotted these frequencies with the Bar Chart node for didactic purposes.
Description: In the accounting firm you work for, you are given contracts which were executed on different dates. Your goal is to create a method that can label each of these contracts with their corresponding fiscal year. Note: A fiscal year is a period of 12 months that is used in government accounting, usually for budget purposes and financial reporting. Its definition varies from country to country, so your solution should be flexible and include flow variables for both the start and the end dates of the fiscal year. As an example, the federal fiscal year in the United States is the 12-month period that begins on October 1st and ends on September 30th of the following year.
Author: Victor Palacios
Solution Summary: We started by reading the contract dates and the lookup tables with the defined fiscal years. Next, for each fiscal year, we selected those contracts whose dates fell within it and labeled them accordingly. The process of identifying which contracts fall within a given fiscal year was made flexible with the use of flow variables and the Date&Time-based Row Filter node. After all contract dates were labeled with their fiscal years, we sorted them by date to facilitate the understanding of the data.
Solution Details: We started by reading the contract dates and the lookup tables with the defined fiscal years with two Excel Reader nodes. Next, we converted the contract dates to Date&Time with the String to Date&Time node to facilitate the identification of contracts tied to a given fiscal year downstream. We then started looping over the fiscal years, generating flow variables for their start and end dates, with the Table Row to Variable Loop Start node. Inside the loop, these flow variables were used to select the contracts whose dates fell within their range. To this end, we used the Date&Time-based Row Filter node. Next, the Constant Value Column node was used inside the loop to label all selected contract dates with the current fiscal year label. We closed the loop with the Loop End node, which collected and concatenated the selected contracts at every iteration. We then finished the workflow by sorting the labeled data by contract date, facilitating interpretability. Note: In this solution, contracts that do not fall within the range of any fiscal year get filtered out.
Description: You are using KNIME to monitor the daily price of a product online. After using the Line Plot node to visualize the daily prices you have already gathered, you notice that they are often constant for a certain number of days before changing again. You want to create a new column in the price data you have at hand, named "Change", such that its value is 1 if a daily price changed with respect to the previous day, or 0 if it remained unchanged. For the first daily price in the data, the "Change" value should be 1. As an example, if the initial daily prices look like:
You should end up with data in the following format:
Date Price Change
2015-01-01 10 1
2015-01-02 10 0
2015-01-03 11 1
Author: Emilio Silvestri
Dataset: Daily Prices in the KNIME Hub
Solution Summary: To create column "Change" according to the challenge's description, we first created a temporary column, named "Price(-1)", holding the price value for the previous day (that is, a column just like "Price", but with a 1-day lag). Next, we compared the lagged and current prices to determine whether there was a change in value or not, leading to the creation of column "Change". Finally, we remove temporary column "Price(-1)" from the final dataset.
Solution Details: We started by reading the dataset with the CSV Reader node, and then plotted the values against the dates with the Line Plot node just to have an idea of how it looks. Next, we use the Lag Column node, which is the core of the solution, to create column "Price(-1)" based on column "Price": the former holds the values of the latter, but lagged by one day. With this new column at hand, we used the Rule Engine node to generate column "Change" by comparing the lagged values in "Price(-1)" with the current values in "Price". If the values were equal, the value in "Change" would be 0; otherwise, it would be 1. We then end the workflow by removing temporary column "Price(-1)" with the Column Filter node.
Description: You received the 2017 cancer data from the CDC for inspection, and your goal is to answer the following questions: (1) What are the top-5 most frequent cancer types occurring in females? (2) What are the top-5 most frequent cancer types occurring in males? (3) Which US state has the highest cancer incidence rate (that is, the highest number of cancer cases normalized by the size of its population)?
Author: Janina Mothes
Solution Summary: To find the top-5 most frequent cancer types occurring in (fe)males in the US in 2017, we preprocessed the data to remove aggregated cancer sites that could lead to wrong counts, grouped the data by cancer type, and pivoted by sex. Next, we sorted the data to find the top-5 most frequent cancer types. As for the highest normalized incidence of cancer, we grouped the CDC data by state, read the states population data, and then joined these two datasets on state. We then normalized the number of cancer cases per state by the corresponding population, and sorted the resulting data to find the state with the largest incidence.
Solution Details: We started by reading the CDC data with the CSV Reader node, and noticed that a few cancer sites were aggregated (e.g., "All Invasive Cancer Sites Combined"). Since this could lead to a less refined understanding of what cancer types were most prominent, we filtered them out with a regular expression in the Row Filter node. For simplicity, we did not explore cancer site codes in our solution. Next, we used the Pivoting node to group the data by cancer sites and to pivot it by sex. This generated a table with a breakdown of cancer types and frequencies per sex. To answer the first question, we fed this table to the Sorter node and sorted it in descending order with respect to column Female+Sum(Count), which was previously created by the Pivoting node. We then used the Column Filter node, followed by the Row Filter node, to get the top-5 rows of the resulting table. This corresponds to the most common cancer sites (potentially including some aggregated sites) in females. The solution to the second question is very similar: after using the Pivoting node, we sorted the resulting table in descending order with respect to column Male+Sum(Count) with the Sorter node, and used the Column Filter and Row Filter nodes to get the top-5 most common cancers in males. To answer the third question, we used an Excel Reader node to read the 2017 US states population data, cleaned the names of the states with the String Manipulation node, aggregated the CDC data by state with the GroupBy node, used the Joiner node to combine the grouped CDC data with the US states population data, normalized the number of cancer cases per state by their population with the Math Formula node, and used the Sorter node to sort the resulting table in descending order with respect to the normalized cancer incidence. Finally, we used the Row Filter node to extract the first row of this sorted table, which corresponds to the US state with the highest normalized incidence of cancer. Note: Depending on how you clean and aggregate the data, you may obtain different results.
Description: A pharmaceutical company used to keep its sales data in a CSV file. They ask for your help to split the data into monthly CSV files and to save them into different subfolders, where each subfolder corresponds to a year. As an example, the data for January 2015 should be stored in a file named 2015/January.csv. In your solution, remember to save the files and subfolders in the workflow data folder.
Author: Emilio Silvestri
Datasets: Sales Data in the KNIME Hub
Solution Summary: This challenge required you to read the CSV file, extract the year and month of the entries, and then implement a loop that, by iterating through years and months, creates yearly subfolders and monthly sales files inside them.
Solution Details: We started by reading the CSV file using the CSV Reader node. Next, we used nodes String to Date&Time to convert the dates in the original file to the correct type, and Extract Date&Time Fields to extract the dates’ corresponding years and months. We then used the Group Loop Start node to iterate over groups of rows that had the same year and month. For each batch of rows, we (1) used nodes Number to String, Table Row to Variable, and String to Path (Variable) to convert their year into the string that is the name of their subfolder, (2) used node Create File/Folder Variables to create the file path corresponding to their month, inside the correct subfolder, and (3) wrote the batch of rows in a correctly located CSV file. We added the node Variable Loop End to collect the variables created in the loop. Note: remember to save the files and subfolders in your workflow data area.
Description: You are asked to build a framework that allows users to interactively visualize datasets of images and manually exclude those pictures that are not of their interest. To test your implementation, you use a dataset containing images from The Simpsons and assume that only Marge Simpson’s pictures are of interest. How would you implement this framework and filter out every image that is not Marge’s?
Author: Emilio Silvestri
Solution Summary: This challenge required you to read the images in the dataset folder and to display them on a composite view to manually select and then filter out images that were not Marge's.
Solution Details: We started by reading the images using the Folder List node and the Image Reader node. Next, we rendered them to SVG format with the Renderer to Image node and displayed the images with the Tile View node. We then opened the interactive, composite view of the Tile View node and manually selected all images that were not Marge’s by clicking on their corresponding tile views. The final node is a Row Filter that excludes all data rows that were selected in the previous composite view, leading to a clean data table. Note: Remember to accept the changes when closing the composite tile view, otherwise the data selection will not be carried on to the next node. Tip: If you have a KNIME server, you can call this workflow from a web browser through the KNIME Server WebPortal. In the web browser, the composite view will show up as a web page and the data selection will be automatically exported into the underlying workflow.