Challenge 30: Music Data App
Level: Medium or Hard
Description: The second season of Just KNIME It! is coming to an end, and just like last season we will celebrate it with an open-ended problem. We invite you to use your creativity to build a data app relating music tracks to their artists through interactive and insightful visualizations. Take a step further and use one of the datasets we provide to make your interface available in multiple languages. Note: As this is an open-ended challenge, we will not be supplying our own solution as we usually do. Feel free to explore the datasets as you see fit!
Author: Aline Bessa
Dataset: Music Data in the KNIME Hub
Remember to upload your solution with tag JKISeason2-30 to your public space on the KNIME Community Hub. To increase the visibility of your solution, also post it to this challenge thread on the KNIME Forum.
Thank you for participating this season and we cannot wait to see the ideas you come up with for this final challenge!Go to the KNIME Forum
Just KNIME It! Leaderboard
KNIME community members are working hard to solve the latest "Just KNIME It!" challenge - and some of you have solved dozens of them already! Who are the KNIME KNinjas who have completed the most challenges? Click over to the leaderboard on the KNIME Forum to find out! How many challenges have you solved?
Description: You work for a Berlin startup and it's Friday night. A table tennis tournament with your colleagues is due! The following problem comes up: how to randomly split all your co-workers in random teams of two players, and then randomly define the direct elimination matches as a dendrogram? Hint: You can use the Hierarchical Cluster Assigner in this challenge.
Author: Paolo Tamagnini
Solution Summary: After reading the participants' data, we create a data app that allows users to see all teams on a table and a dendogram showing an elimination match scheme that could have occurred depending on what teams played against one another.
Description: In this challenge you will play the role of data analyst trying to create a Stacked Funnel Plot from Plotly. While KNIME has many Plotly Plots through its KNIME Plotly extension, you could not find a funnel plot -- use the Python View node to create this type of visualization and show it off in KNIME! Hint: If you want, copy the code from the Plotly link above (e.g., the code for the "Stacked Funnel Plot with go.Funnel" plot) into the Python View node. Be sure to add this line after copy-pasting from the Plotly site: import knime.scripting.io as knio. Be sure to remove the line fig.show() as well. Finally, add this line to the very end of your code: knio.output_view = knio.view(fig).
If you would like to take this challenge from Medium to Hard, try creating your own data with the Table Creator node and passing it to at least one of the axis' variables in the Funnel Plot.
Author: Victor Palacios
Solution Summary: We propose two solutions to this challenge. The first one, simpler, uses the instructions given as hints in the challenge. The second one, more complex, modifies the script such that the visualization can be customized to use different input data.
Solution Details: For the simpler solution, we copied the Plotly script mentioned in the challenge description onto an instance of a Python View node. We added line import knime.scripting.io as knio to the beginning of the script in the node, removed line fig.show(), and added line knio.output_view = knio.view(fig) to the very end of the code. For the more complex solution, we created some mock data to visualize with the Table Creator node. Next, we passed this data to the Python View node, which was further modified with line data = knio.input_tables.to_pandas(). This line makes the code inside this node customizable, and the values for the funnel coordinates are going to be determined by the data given as a parameter.
Description: We explore World Cup datasets in this challenge, with the goal of seeing how many games under FIFA were played between different confederations in the last year (between 2021-2022). We have two files, results.csv and shootouts.csv, from which we need the year when the games were played. We also have a third file, teams.xlsx, detailing teams and to which confederation they belong. The output table should look something like this
AFC UEFA CONCACAF
AFC 34 67 89
UEFA 67 65 12
CONCACAF 89 12 36
This challenge is based on this blog post.
Author: Sanket Joshi
Dataset: World Cup Data in the KNIME Hub
Solution Summary: We address this challenge by combining information from the results.csv and teams.xlsx tables. After having a table with all matches between 2021 and 2022, indicating teams and their confederations, we pivot the information to get all matches per confederation pair in that period.
Description: You work for a company that sells two products: Product A and Product B. As a data analyst, you want to calculate how sales increased or decreased for both products over a period of 12 months, on a monthly basis. In the end, you want to have a table with columns indicating the increase/decrease percentages of sales per month, as in the example below.
Month | Percentage_Sales_Product A | Percentage_Sales_Product B
February | 45 % | 100%
March | 78% | -87%
Author: Sanket Joshi
Dataset: Sales Data in the KNIME Hub
Solution Summary: To tackle this challenge, we loop over each sales column computing how sales increased or decreased with respect to the following month, lagging column values to perform these operations. In the end, we condense the columns with the calculated percentages into a single table.
Solution Details: After reading the sales data with the CSV Reader node, we start a loop iterating over the sales columns (Column List Loop Start node). Inside the loop, we make the name of the sales column generic by using the Column Renamer node, and then lag the sales column by one month (Lag Column node). Next, we use the Math Formula node to calculate the percentage of increase or decrease in sales and store this information in a new column, created with the String Manipulation (Variable) and Column Renamer nodes. We then filter temporary columns we create inside the loop and end up with a table in the format asked for in the challenge.
Description: You work as a data scientist for a healthcare company attempting to create a predictor for the presence of heart disease in patients. Currently, you are experimenting with 11 different features (potential heart disease indicators) and the XGBoost classification model, and you noticed that its performance can change quite a bit depending on how it is tuned. In this challenge, you will implement hyperparameter tuning to find the best values for XGBoost's Number of Boosting Rounds, Max Tree Depth, and learning rate hyperparameters. Use metric F-Measure as the objective function for tuning.
Author: Keerthan Shetty
Dataset: Heart Disease Data in the KNIME Hub
Solution Summary: To solve this challenge, we create a workflow segment with integration deployment nodes that is responsible for training and assessing the quality of a XGBoost classifier. This workflow segment is given as input to a component that performs hyperparameter optimization, along with information on what should be optimized. The best F-measure values (0.87 on average) are obtained using Number of Boosting Rounds = 50, Max Tree Depth = 6, learning rate = 0.1.
Description: In this challenge you will take the role of cybersecurity analyst, and see if you can identify emails that are trying to pass as legitimate when they are in fact malicious. You notice that bad-actor emails try to trick the receiver by mimicking major email domains. For instance, you notice that @gnail, @gmial, etc. are trying to pass as @gmail. You then decide to get a count of all the domains: those that have the lowest count have a higher probability of being fraudulent. You must also check whether those low-count email domains are trying to pose as the major emails domains or not. Your answer should not mark @unique.com as fraudulent. Note: Try not to hard-code any variables in your workflow, but instead use mean or median for instance. Hint: Checking for string similarity might help.
Author: Victor Palacios
Dataset: Domains Data in the KNIME Hub
Solution Summary: To tackle this challenge, we first process the domains data to separate frequent ones from rare ones (the latter are considered more suspicious). Next, we compare low count email domains to more frequent ones (similarity search) to determine which domains are potentially trying to pose as popular while being fraudulent.
Solution Details: After reading the domains data with the CSV Reader node, we use the Cell Splitter node to separate the actual domains from their usernames. Next, we use the GroupBy node to count the occurrence of each domain, and the Math Formula node to calculate the average domain occurrence count. Domains that occur more frequently than the average are likely to be legitimate; those that do not are more likely to be fraudulent. We separate the domains according to this logic using the Numeric Row Splitter node. Next, we use the Similarity Search node to compare low count email domains to more frequent ones, determining which rare domains are likely trying to pose as popular ones. We then use the Joiner node to combine information on the most similar domains and then calculate the median distance for all pairs using the Math Formula node. Finally, rare domains whose distances to frequent domains are lower or equal to the calculated median are marked as potentially fraudulent with the Numeric Row Splitter node.
Description: You and your team are working on a lexicon-based sentiment snalysis application, and created a component that encapsulates the calculation of sentiment scores. This component will be re-used in many workflows, so it is important to make sure that it is behaving correctly. A principled way of accomplishing this is through automated unit testing. Given a sample of the data this component receives as input (input golden data), and a small table with the corresponding expected outputs (output golden data), your goal is to create a workflow that automatically tests this component (testflow). Note: Read more about how to perform testing in KNIME here. Note 2: Read more about lexicon-based sentiment analysis here.
Author: Aline Bessa
Solution Summary: To solve this challenge, we compare whether the output golden data of component Calculate Scores is equivalent to its output once it is fed with the input golden data. We also capture metadata on the workflow and on the test execution to have an idea of what failed (if anything) and in what execution.
Solution Details: We start our solution by capturing metadata on the workflow and current execution with the Workflow Metadata node. Next, we use a Try/Catch block within which we implement the unit test -- the idea is that, even if the test fails, the workflow doesn't "break" and we can still capture the execution results and metadata. For the Try/Catch block, we use the Try (Variable Ports) and Catch Errors (Var Ports) nodes. For the actual test, we use two Table Reader nodes to read the input and output golden data, feed Calculate Scores with the input golden data, and then use the Table Difference Checker node to compare its result with the output golden data. The test passes, but if this were not the case we have some alternative data for the Catch Errors (Var Ports) node, created with the Variable Creator node. This guarantees that there is always information on the unit test coming out of the Try/Catch block. Finally, we use the Variable to Table Row node to turn the test status and workflow metadata into table format. This facilitates the exporting of the information in case we want to build a report, for example.
Description: Last week you helped caddie Tom by identifying golf balls in the provided images. Tom is impressed by your work and asks you to improve your workflow, such that it also counts how many golf balls were identified. Essentially, for each image that Tom gives you, he wants to receive the number of golf balls that there are on it.
Author: Daria Liakh
Dataset: Golf Ball Data on the KNIME Hub
Solution Summary: After reading the single golf field image in the given dataset, we start our solution like last week's: we perform a Gaussian Convolution over the image, distinguish between foreground and background to isolate the golf balls, fill holes in the golf balls to better segment them, and then perform a connected component analysis to label the golf balls. Next, in order to determine the number of golf balls in the image, we label segments that are touching the border (we use RGB channels to perform this border test), and then finally count the number of labels that were generated. This number corresponds to how many golf balls there are in the image.
Description: Caddie Tom, who assists one of the most popular golf players, has asked for your help. He is tired of looking for golf balls in the field -- distinguishing between white balls and green grass can be tiresome! So he wants you to make a workflow that will help him identify golf balls on images he took from the field. Your task then is to segment the images containing golf balls. Hint: The KNIME Image Processing extension is very important for this challenge.
Author: Daria Liakh
Dataset: Golf Ball Data on the KNIME Hub
Solution Summary: After reading the single golf field image in the given dataset, we create two different solutions for the challenge. First, we use a shared component that performs image segmentation. Second, we manually do what the shared component essentially does: we perform a Gaussian Convolution over the image, distinguish between foreground and background to isolate the golf balls, fill holes in the golf balls to better segment them, and then finally perform a connected component analysis to label the golf balls.
Description: You work for a travel agency and want to better understand how hotels are reviewed online. What topics are common in the reviews as a whole, and what terms are most relevant in each topic? How about when you separate the reviews per rating? A colleague has already crawled and preprocessed the reviews for you, so your job now is to identify relevant topics in the reviews, and explore their key terms. What do the reviews uncover? Hint: Topic Extraction can be very helpful in tackling this challenge. Hint 2: Coherence and perplexity are metrics that can help you pick a meaningful number of topics.
Author: Aline Bessa
Dataset: Hotel Reviews Data in the KNIME Hub
Solution Summary: The first task we address in this challenge is determining how many topics are ideal for (1) the reviews as a whole, (2) the positive reviews with rating 4 or 5, and (3) the negative or neutral reviews with rating between 1 and 3. Many metrics can be used to determine a good number of topics, but in our solution we focused on minimizing perplexity. For all three cases, the best number of topics is k = 2. We then create LDA topic models for cases (1), (2), and (3) and visualize their most relevant words in tag clouds. In good reviews, there seems to be a focus on water-related activities (resorts with pools, beach hotels etc); negative and neutral reviews seem to discuss facilities a bit more (bathroom, location, service, restaurant etc); and the overall reviews do not seem to have topics that are as immediately clear.
Level: Easy or Medium
Description: In this challenge you will take the role of a clinician and check if machine learning can help you predict diabetes. You should create a solution that beats a baseline accuracy of 65%, and also works very well for both classes (having diabetes vs not having diabetes). We got an accuracy of 77% with a minimal workflow. If you'd like to take this challenge from easy to medium, try implementing:
Author: Victor Palacios
Dataset: Diabetes Data in the KNIME Hub
Solution Summary: Our minimal workflow that beats the baseline accuracy trains a Random Forest classifier with 70% of the original, annotated dataset, and assesses its quality with the remaining 30% of the data. The data is split using stratified sampling due to its heavy class imbalance.
Description: Using the Geospatial Analytics extension for KNIME, visualize on a map what you think is the best place in the world (e.g., your hometown). We advise you to make it big enough so that it is also visible on the full world map.
Author: Rosaria Silipo
Solution Summary: We solve this challenge by visualizing the polygons for Firenze in a map that can be zoomed in and zoomed out and, to get a better perspective of where Firenze is, we also use the North and South pole locations to get a second map that shows Firenze's location in a broader geospatial context.
Solution Details: We start tackling this challenge by using the OSM Boundary Map node to get Firenze's geometry. In parallel, we use an alternative way of getting geometries: with the Table Creator node, we input the latitude and longitude of the North and South Poles; next, we use the Lat/Lon to Geometry node to get their geometries. After that, we use the Concatenate node to combine Firenze's and the Poles's geometries. Finally, we use the Geospatial View node to get a more detailed, polygon-oriented map for Firenze, and the Spatial Heatmap node to get a view of Firenze in a larger worldwide map containing the Poles's locations for reference.
Description: The goal of this challenge is to remind you of the importance of building robust solutions with error handling and logging. These are best practices in workflow development that are often forgotten, but make a big difference in the end -- especially if the application is critical or if you are working with a team. Concretely, you will pick any workflow of your choice and add the following elements to it:
Error handling: Use the Try & Catch construction;
Logging - Extract the workflow name, username, start and end of the execution. Next, create a table containing these metadata as well as the name of a failing node (if any) and its message in one row, and write this table to a CSV file. Make sure that once the workflow is executed again, a new row is appended to the log table in the CSV file. Optionally, you can add more columns to your log file, e.g., the column with a custom message depending on the execution status - success or failure. You can also create a logging component that takes workflow name, username, start and end of the execution as input, allows configuring the custom message, and updates the log file.
Author: Lada Rudnitckaia
Solution Summary: In our solution, we have a component that is not executing correctly, generating errors. This component (“Example workflow”) is surrounded by a Try & catch structure for error handling. This structure maintains information on whether or not the component failed, and also where. We also keep some meta information on the workflow execution (e.g., when it was executed and by whom) and log it, along with the information on the failed component. The log file is stored in the workflow’s data area.
Solution Details: We start our solution by extracting the workflow’s name and username of who is execution it (Extract Context Properties node). Next, we use the Date&Time Configuration node to get a timestamp for when the execution of the failing component (“Example workflow”) approximately starts. We then surround this component with Try (Variable Ports) and Catch Errors (Var Ports) in order to capture flow variables that contain information on whether and why the component failed. We add a Variable Creator node, and connect it to the Catch Errors (Var Ports) node just to illustrate that one may provide alternative data for the execution of the latter. Note that we do not really use it in our solution, though. After we have accumulated all the meta information on the workflow and information on the failing component, we use another instance of the Date&Time Configuration node to get a timestamp for when the execution of the failing component approximately ends. All of this information, which is in the format of flow variables, is sent to the Update Log component, which logs it all in a file in the workflow’s data area.
Description: You are a senior student in a US high school, and you are going to apply for college soon. The costs of undergraduate studies in the US can be pretty high, and you want to understand what factors influence the costs -- and how. To make an informed and financially responsible decision, you decided to perform some Explanatory Data Analysis and cost modeling. What conclusions do you get to?
Author: Jinwei Sun
Solution Summary: To tackle this challenge, we used a variety of plots to check how the average undergraduate cost in the US varies depending on different factors (state, year, type of school etc). We also used a simple Decision Tree model to see how the cost values generalize per institution. The idea is for students to use this information to help determine what school they should attend.
Solution Details: In our solution, we start by grouping the cost data per state with the GroupBy node, then sorting it in descending order with the Sorter node, and then visualizing this information with the Bar Chart node. The most expensive region of the US in terms of undergraduate cost is DC. Next, we group the data again per year (GroupBy node) and use the Line Plot node to visualize the cost time series. The data shows that from 2013 to 2020 the cost for an undergraduate education has risen considerably, on average. We also use the Bar Chart node to visualize average undergraduate cost per school type, which indicates that Public In-State schools are the cheapest for a student to attend. Finally, we partition the data (Partitioning node), train and test a Simple Regression Tree model (Simple Regression Tree Learner and Predictor nodes), and assess its quality with the Numeric Scorer node. The attributes in the data explain the variability in about 95% of the test data (R-squared score of 0.952), suggesting that factors such as school type and state are great indicators of undergraduate cost in the US, and thus are very likely to help students make sound decisions as to where they go to college.
Description: You are a freelance data scientist and are asked to help an actuarial agency. To compute premiums for life insurances, actuaries need to find the age-dependent information about customers from an Actuarial Life table containing the probability of dying within 1 year and life expectancy. Your task is to make this process faster: create a dashboard in which an actuary specifies the age of the customer and this action updates probabilities and life expectancy for both genders. Feel free to use either tables or graphical representations of data. Hint: To combine values of the first 3 rows to use as column headers, refer to this workflow from the Community Hub. Hint 2: The Integer Widget does not refresh the interactive view of the component. Consider adding the Refresh Button Widget.
Author: Daria Liakh
Solution Summary: To tackle this challenge, we start by reading the Actuarial Life table, processing its header values, and then combining them. Next, we filter out columns that are not relevant to the problem, and then create a dashboard in which users can check the average number of remaining years a population has depending on their age (which they can choose), along with their individual probabilities of dying within a year.
Description: The goal of this challenge is to create an interactive dashboard to rank countries according to their covered LGBTQIA+ rights. The provided dataset contains 11 types of laws about LGBTQIA+ rights, such as CSSSA LEGAL (Consensual Same-Sex Sexual Acts between Adults), CONST (Constitutional Protection), and BROAD PROT. (Broad Protection). You can read a full description of them in the header provided here. Assign 1 point for each law fully provided by the country, 0.5 if its coverage is limited, and 0 if there is no law. Which countries observe LGBTQIA+ rights the most? Hint: Enable interactivity by adding a selection widget: only the laws selected by the user should be counted in the country score.
Author: Emilio Silvestri
Dataset: LGBTQIA+ data in the KNIME Hub
Solution Summary: To tackle this challenge, we first encoded all categorical values in the dataset, based on whether the law is fully provided in the country, its coverage is limited, or if there is no law. In parallel to that, we extracted all column names so that we end up with a list containing all available law types. We then used a choropleth map to visualize the countries' ranks and Widget nodes to enable interactivity.
Solution Details: After reading the data with the CSV Reader node, we split the table into two parts using the Column Splitter node, such that all relevant law columns are located at the second output port of the node. We then encoded all categorical values, based on whether the law is fully provided in the country (1), its coverage is limited (0.5), or if there is no law (0). We did this by using the Column Expression node enclosed within a loop (Column List Loop Start node and Loop End (Column Append) node). We added back the remaining columns (COUNTRY, DATE OF DECRIM, PENALTY) using the Column Appender node. In parallel to the loop, we extracted all column names and aggregated them into a list, using the Extract Column Header node followed by a Column Aggregator node. To visualize the countries’ ranks we used the Choropleth Map component, encapsulated within the Interactive Dashboard component. To allow for interactivity, we added a Multiple Selection Widget node that allows to select the laws of interest, and a Refresh Button Widget node to trigger re-execution. The respective score for each country based on the selected laws is calculated with the Column Aggregator node.
Level : Medium
Description: You ara a data scientist working for a real estate company, and heard a rumour that the "average number of rooms per dwelling" (RM) may be connected to the "per capita crime rate" (CRIM) depending on the city/town. You then decide to investigate if this is the case for Boston, the city where you live and work from. To this end, you decide to experiment with a machine learning regression model and with a topic that you have recently been studying: XAI. How are RM and CRIM connected in Boston? Hint: Consider calculating the SHAP values of each independent feature using a SHAP loop. Hint 2: Consider using a dependence plot to verify how RM and CRIM are connected visually.
Author: Keerthan Shetty
Solution Summary: To tackle this challenge, we use AutoML to learn a model for the median value of owner-occupied homes in Boston (in $1000's). In parallel, we use a SHAP summarizer to cluster the data into different prototypes, and we also sample a few rows to check how they can be "explained" by the learned model. We then feed the prototype and the rows for explanation into a SHAP loop, which performs SHAP value calculations using the learned model and the clustered prototypes (using SHAP Summarizer sampling weight). Next, we join the SHAP values with the separated rows for explanation and generate a dependence plot. We can see that if the average number of rooms per dwelling is greater than 7, crime rates are very low, and the SHAP value is very high. This suggests that it has a significant impact on increasing the median value of a house.
Level : Easy
Description: You are contacted by the police department and asked for some help. They need to provide the press with an image of a witness, but they must preserve their face a secret. They want you to blur the image, such that it is clear that it corresponds to an actual person but their identity still remains undisclosed. Hint: Consider using the Gaussian Convolution node.
Author: Daria Liakh
Dataset: Witness Image in the KNIME Hub
Solution Summary: To tackle this challenge, we read the image and use a Gaussian convolution to blur it.
Solution Details: After using the List Files/Folders node to read the path for the image, we convert the path's type to string with the Path to String node and send it to the Image Reader (Table) node, which finally reads the actual image. Next, we send the image to the Gaussian Convolution node in order to blur it. Note: You can blur the image more by increasing the value for the sigma parameter in the Gaussian Convolution node.
Level : Hard
Description: Language models have become extremely popular, and can be efficiently incorporated into your projects through an API. You can interact with an API by sending HTTP requests to one of its endpoints, which should include input data (e.g., text or a prompt) that you want the language model to process. For this challenge, you are asked to use OpenAI's "text-davinci-003" GPT-3 model to build a KNIME workflow that answers natural language queries -- for example, "What is KNIME?". Optional: You can also built this as a KNIME component. Note: You will need the API Key from OpenAI to complete this challenge. To get it, follow the instructions in this blog post. Here's also OpenAI's documentation for making API requests. Finally, do not share your API Keys while uploading your solution to KNIME Hub.
Author: Mahantesh Pattadkal
Solution Summary: Our solution to this challenge gets information for the davinci-003 API, a query for it (e.g., “What is KNIME?”), and then performs a POST request to obtain the model’s answer to the query. The information is then deserialized, filtered, and visualized in a tile.
Solution Details: To tackle this challenge, we start by gathering API information, a query, and other parameters with the String Configuration, String Widget, and Variable Creator nodes. The query and parameter flow variables are merged with the Merge Variables node, and reformatted for the POST Request node with a combination of the Variable to Table Row and table to JSON nodes. The POST Request node submits the query to the davinci-003 GPT model, and the result is reformatted with the JSON Path, Ungroup, and Column Filter nodes. Finally, we use the Tile View node to visualize the result for the query.
Level : Easy
Description: Restaurant Yummy records its sales values on a daily basis. The data contains two columns: Date and Sales. Year-to-Date (YTD) Sales and Month-to-Date (MTD) Sales are important metrics to track the revenue of the business. But what exactly are YTD and MTD values? The YTD value allows you to calculate the metric (e.g., sum of sales) for the current year, while the MTD value denotes a metric value for the current month. You are asked to build a KNIME Workflow that takes this data as input and adds YTD and MTD values across each record.
Author: Mahantesh Pattadkal
Dataset: Sales Data in the KNIME Hub
Solution Summary: To tackle this challenge, we group the data by month and, using lags based on the previous months' values, calculate the MTD. We also use moving aggregation to go over the sums of sales to calculate the YTD.
Solution Details: After reading the data with the Excel Reader node, we transform its columns and extract the date fields with the String to Date&Time and the Extract Date&Time Fields nodes respectively. Next, we group the data by month (GroupBy node), lag it to get the previous months' sales values (Lag Column node), and perform some missing value imputation with zeroes (Missing Value node) and some light column renaming for readability (Column Rename node). In parallel, we use the Moving Aggregation node to go over all sales values performing their sum (YTD), followed by some light column renaming as well. We then use the Joiner node to combine both sources of data, use the Math Formula node to finalize the calculation of the MTD values based on lagged information from previous months, remove temporary columns with the Column Filter node, and then use a Table View node to plot all information.
Level : Hard
Description: In this challenge, your goal is to see which features are the most important in predicting the quality of wine. After doing this analysis, you should create a visualization that shows the features' importances in order.
Author: Keerthan Shetty
Dataset: Wine Data in the KNIME Hub
Solution Summary: We solve this challenge by comparing the root mean squared error obtained with an AutoML regression model over a subset of the data, and over the whole data with feature value permutations -- one feature at a time. In the end, our conclusion is that alcohol content is likely the most important feature that determines wine quality.
Solution Details: To tackle this challenge, we start by using AutoML for regression to train a model and get the scores for the original input features. We also do some quick feature processing (mostly to rename a few features) in parallel, and then perform permutation feature importance calculation. In practice, for each feature, we shuffle its values and then apply the AutoML model we learn. Then we compare the root mean squared error we obtain over a subset of the data, separated to validate the AutoML model, with the root mean squared error we obtain for each permutation. In our setup, these comparisons led us to verify that alcohol content is likely the most important feature. A plot visualization shows all feature importances following this methodology.
Description: The Eurovision Song Contest (ESC) is a song competition with participants mostly from european countries. In the contest finale, each country gives points to the others, and the sum of the received points determines the winning country. In this challenge, you will experiment with the Network Mining nodes in order to visualize (in a Network Viewer) the points given and received by each country. Note: The contest regulation changed over time, and it might make no sense to aggregate points from different editions.
Author: Emilio Silvestri
Dataset: Eurovision Data in the KNIME Hub
Solution Summary: To tackle this challenge, we start by filtering the data, keeping only the rows that correspond to the 2021 edition and with points given by televoters. Next, we build a component that allows users to select a focus country (e.g., Cyprus) and then visualize what points this country gave to others and which points it received. Both visualizations follow a graph/network design.
Solution Details: After reading the data with the CSV Reader node, we use the Row Filter (Labs) node to filter the data by edition and points type. Next, we create a component named Network Visualizations that (1) allows users to select a focus country (first, the GroupBy node is used to extract them; the countries are then sorted and fed into the Value Selection Widget node); and (2) plots the points given and received by the selected focus country — Row Filter nodes receive the selected focus country and fetch the points given and received by it; next, they send this information to Object Inserter nodes, which turn it into nodes and edges; finally, this network information is sent to two instances of Network Viewer nodes. These nodes create the network/graph visualizations for the points given and the points received by the selected focus country.
Description: By visually analyzing successfully completed passes from football match data, it is possible to obtain valuable information about a team's performance. This can in turn be used by coaches and team managers to make informed decisions about how to improve their game. In this challenge, your task is to generate a visualization that adequately showcases the starting positions of successful passes made by the players of Manchester City team. To this end, you will use data that includes both starting (x and y) and ending (endX and endY) positions of various types of passes made during a football match between Manchester City and Manchester United.
Author: Keerthan Shetty
Solution Summary: To tackle this challenge, we filter the data and end up only with rows that correspond to successful passes by Manchester City. We then generate a heat map with the starting positions of these passes. We can see that most of them are initiated from the center of the field.
Solution Details: After reading the data with the CSV Reader node, we use the Missing Value node to remove rows that do not have enough position values. Next, we use the Rule Engine node to create an indicator for rows that correspond to successful passes by Manchester City. With the Row Filter node, we only keep the rows that correspond to this indicator. Finally, we use the 2D Density Plot (Plotly) node to generate a heat map with the starting positions of these passes.
Description: You work for a Marketing agency that monitors the online presence of a few airline companies to understand how they are being reviewed. You were asked to identify whether a tweet mentioning an airline is positive, neutral, or negative, and decided to implement a simple sentiment analysis classifier for this task. What accuracy can you get when automating this process? Is the classifier likely to help company reviewers save their time? Note: Given the size of the dataset, training the classifier may take a little while to execute on your machine (especially if you use more sophisticated methods). Feel free to use only a part of the dataset in this challenge if you want to speed up your solution. Hint 1: Check our Textprocessing extension to learn more about how you can turn tweets' words into features that a classifier can explore. Hint 2: Study, use, and/or adapt shared components Enrichment and Preprocessing and Document Vectorization (in this order!) if you want to get a part of the work done more quickly. They were created especially for this challenge. Hint 3: Remember to partition the dataset into training and test set in order to create the decision tree model and then evaluate it. Feel free to use the partitioning strategy you prefer.
Author: Aline Bessa
Solution Summary: We tackle this challenge by preprocessing the tweets in order to turn them into term frequency bags of words, such that words in tweets can be explored as features by classifiers. We then partition the data into training and testing and use an AutoML component to get the best model (and parameters) for the given training data. In our case, this model was a Logistic Regressor. After applying this model over the testing data, we got an accuracy of 77%.
Solution Details: After reading the tweets from airline consumers (Table Reader node), we enrich the text with tags for positive and negative words, remove punctuation and stop words, perform stemming, and execute other text preprocessing tasks (Enrichment and Preprocessing component). Document vectors are then created to encode each tweet into a vector space. Next, the processed tweets are turned into term frequency bags of words, and each one of them is associated with its sentiment label (class) (Document Vectorization component). We then partition the data using stratified sampling on sentiment (class), keeping 80% for training and 20% for testing (Partitioning node). The training data is sent to an AutoML component that compares different classifiers and different parametrization/processing options. In the end, for that training data, the best model was a Logistic Regressor. We apply the model over the testing data (Workflow Executor node) and get an accuracy of 77% (Scorer node).
Description: Earth Day just passed and we need to raise awareness for air quality and water pollution. In this context, you decide to explore a 2020 dataset in which users voted about air quality and water pollution for a number of different cities. The scale for air quality goes from 0 (bad quality) to 100 (top quality), and the scale for water pollution also goes from 0 (no pollution) to 100 (extreme pollution). Your task is to rank and visualize the top 10 countries twice with different criteria: first, based on how bad the air quality is; and second, based on how polluted the water is.
Author: Daria Liakh
Solution Summary: To tackle this challenge, we group the data by country, getting the mean air quality and water pollution levels to represent each one of them. Next, we sort the resulting data in two different ways: first, by air quality in ascending order; second, by water pollution level in descending order. Finally, we select the top 10 countries with worst (mean) air quality and worst (mean) water pollution levels, and visualize the corresponding information with two bar plots. In our solution, Benin turned out to be one of the countries with highest water pollution levels, and Central African Republic is the country with worst air quality.
Solution Details: After reading the dataset with the CSV Reader node, we remove unnecessary region information with the Column Filter node and then group the city-level information by country with the GroupBy node, such that in the end we have the mean levels of air quality and water pollution for each one of the countries in the dataset. Next, we use two instances of the Top k Selector node to sort the data by air quality (ascending order) and by water pollution level (descending order). We send the resulting data to a component in order to get a composite view with two bar plots: one for the countries with worst air quality, and one for the countries with worst air pollution levels. We use two instances of the Bar Chart (Labs) node for this composite view.
Description: Your team created a summary of the total amount spent last year on each project. The data looks like the following:
Year Area Project exp
2022 area A proj A 120
2022 area B proj X 160
The finance department of your company, however, would like to have a more granular overview of the expenditures. In particular, the expenditures for projects in area A should be spread equally across 12 months, while those for projects in area B should be grouped by quarters. In the end, the above table should look like the following:
Year Area Project exp
2022 area A proj A jan 10
2022 area A proj A feb 10
2022 area B proj X Q1 40
2022 area B proj X Q2 40
Solution Summary: To solve this challenge, we started by separating the data into Area A and Area B. Next, we created auxiliary tables for the year’s months and quarters, and combined them with the 12th part of each project’s expenditures in the case of Area A, and the 4th part of each project’s expenditures in the case of Area B. Finally, we concatenated the combined information into a single table.
Solution Details: After reading the dataset for this challenge with the Table Reader node, we use the Row Splitter node to split it based on Area (Area A or Area B). We then forward the rows that correspond to Area A to the Math Formula node, which divides the expenditures of each project in Area A by 12. In parallel, we use the Table Creator node to create labels for the 12 months of the year, and then combine the expenditures’ fractions for the projects in Area A with these labels by using the Cross Joiner node. As a result, we end up with a table that has the expenditures for each project in Area A equally spread across 12 months. Similarly, we forward the rows for projects in Area B to another instance of the Math Formula node, which divides their corresponding expenditures by 4. Another instance of the Table Creator node creates labels for each quarter of the year, and the Cross Joiner node is again used to combine expenditures’ fractions and quarter labels. In the end, the expenditures for projects in Area B are equally spread across the quarters. These two more granular tables are then finally combined with the Concatenate node. Note: The Cross Joiner node is very expensive as the number of rows in the output is the product of both input table row counts. If executed in streaming mode, only the top input will be processed in a streamable fashion.
Description: The sales manager at ABC Grocery Stores is looking for a way to compare the number of sales of a selected month with those of a prior month. You are tasked to create a component in KNIME which carries out this task, with a visualization tackling the comparisons. If there are no prior months then only data for the selected month should be shown. If a prior month can be picked for comparison, it should not be more than 12 months before the selected month.
Author: Ali Marvi
Solution Summary: To solve this challenge, we built a component that allows users to (1) select the month against which they want to compare sales numbers, and (2) a second previous month based on a gap. As an example, users can pick June 2020 for (1) and, by entering a gap of -3, also pick March 2020 as (2) for comparisons. The component also contains a bar chart to compare sales numbers for both selected months, and a refresh button that allows users to vary their selections and resulting plots.
Solution Details: After the data is read with the Excel Reader node, it is fed to the component we built (Comparing Monthly Sales). Inside this component, we have three widget nodes to capture user’s inputs: the Date&Time Widget node for a month selection, the Integer Widget node for the gap selection, and the Refresh Button Widget node at the very beginning of the flow to allow users to vary their comparisons. Next, all inputs given by a user are merged with the Merge Variables node, and their values are captured with the Variable to Table Row node. These values are preprocessed with the String to Date&Time and Date&Time Shift nodes in order to properly filter the data based on the given input. After a bit more of preprocessing, the filtering is finally performed with the Reference Row Filter node, which takes the original data and the two selected months as input and outputs the corresponding data rows (months and sales numbers). The dates are converted back into strings with the Date&Time to String node to be compatible with what the Bar Chart node takes as input. This node finally plots the sales numbers for the two currently selected months.
Description: Credit Card company ABC maintains information about customer purchases and payments. The information is available for individual customers as Payments Info and Purchase Info. The company wants to segment the customers into three (3) clusters, so that marketing campaigns can be designed according to each cluster. You are asked to use both infos together to build a clustering model that adequately segments the customers. What patterns do customers in the same cluster have in common? Also, Information for newly registered customers is available. You are asked to assign cluster labels to newly registered customers using the trained clustering model, and then export the results into a CSV file. Do the assignments make sense? How do you assess their quality?
Author: Mahantesh Pattadkal
Dataset: Customer Data on Kaggle
Solution Summary: To tackle this challenge, we start by partitioning the data into existing and new customers. Next, we preprocess the data for existing customers to address missing values and also to normalize it, facilitating its posterior clustering. We then use k-Means to identify three (k = 3) clusters in this data, whose quality we evaluate by calculating the Silhouette coefficient (the value was close enough to 1, so the clusters had acceptable quality). We also visualize the identified clusters with a scatter plot. Finally, we assign the identified cluster IDs to the new customers, grouping them as well.
The clustering was helpful to segment customers into categories low profile (Cluster_1, with users who have low order payments and few purchases); potential (Cluster_0, with users who have made medium to high order payments but a relatively low number of purchases); and premium (Cluster_2, with users who made high order payments and a high number of purchases). Marketing campaigns could focus on these different categories in order to boost revenue.
Solution Details: We start our solution by using the Partitioning node in order to separate the data into existing and new customers. Next, we use the Missing Value and Normalizer nodes in order to not only preprocess the data for existing customers, but also to create models to clean the new customer data (Missing Value (Apply) and Normalizer (Apply) nodes). After clustering the preprocessed data for existing customers with the k-Means node, we denormalize it (Denormalizer node) and visualize it (Color Manager and Scatter Plot nodes). Additionally, we also assess the quality of the clusters identified with k-Means with the Silhouette Coefficient node. Since the quality is good, we assign the identified cluster IDs to the preprocessed new customer data (Cluster Assigner node). We then denormalize this data (Denormalizer node) and write it to a CSV file with the assigned cluster IDs (CSV Writer node).
Description: You are doing research on trading and different currencies and one of your tasks is to compare the daily exchange rates of US dollar vs 8 other currencies from 1980 to 1998. The data contains some missing values due to public holidays on weekdays, which you should handle first using a strategy of your choice. After that, you should calculate the correlations between the exchange rates of the currencies and visualize them, Which currencies had the highest positive/negative correlation? Which currencies were not correlated, or had a low correlation?
Author: Aline Bessa
Solution Summary: After reading the dataset with daily exchange rates, we used the simple strategy of replacing missing values with previous dates'. Next, we computed the linear correlations across all currency pairs using Pearson's coefficient, and then visualized them with a heatmap. The highest positive correlations involved the Swiss, German, and Dutch currencies. The highest negative correlation was between the Australian and Japanese currencies. The correlations closest to zero were between the Canadian currency and the Dutch/German currencies.
Solution Details: We started to tackle this challenge by using the Table Reader node to read the dataset with daily exchange rates. Next, we used the Missing Value node to replace each missing value with its previous non-missing one, and then computed Pearson correlations across currency pairs using the Linear Correlation node. Finally, we used the Heatmap node to visualize these correlations in an interactive fashion.