Just KNIME It!

The challenges are back! Register now

Challenge 13 - Stockout Forecasting

Level: Medium

Description: Claudia, the CEO of a small supermarket chain in the US, is worried about stockouts: a situation that occurs when customer orders exceed the available inventory of an item. She would like to know which warehouses are likely to suffer the most from stockouts, and which types of items are going to be the most problematic: critical, regular, or slow-moving ones. To help Claudia, you decided to create a model to forecast stockouts for the different warehouses and item types, allowing her to interactively check the forecasts. What places and item types are more vulnerable to stockouts?

Beginner-friendly objectives: 1. Read a historical dataset on stockouts for all warehouses and item types and filter it by location "All" and item type "Regular Items". 2. Partition the filtered data, train a forecasting model (you decide which one!), and evaluate the quality of the forecasts. 3. Visualize the forecast results against the real ones.

Intermediate-friendly objectives: 1. Make the selection of location and item type more flexible with widgets, creating an interactive data app that makes visualizations more interactive. 2. Some combinations of warehouse location and item type lead to very few samples, which can turn training a forecast model infeasible. Add error handling techniques to make sure that the data app always executes without errors, and inform users if the number of samples for a certain combination is too small.

Author: Aline Bessa

Dataset: Stockout Data on KNIME Community Hub

Remember to upload your solution with tag JKISeason4-13 to your public space on KNIME Community Hub. To increase the visibility of your solution, also post it to this challenge thread on KNIME Forum.

We will post our solution to this challenge here next Tuesday.

Go to KNIME Forum

Previous challenges

Level: Medium

Description: Imagine you are managing your team's work using a Kanban board to track progress. On the board, there are tickets for different tasks and each ticket can move from one stage to the next, as work progresses. Each ticket has a unique ID, the date the ticket was created, the date of the move, and the “from” and “to” stages of the move. Typically, tickets move across different stages in this sequence: "New" → "Doing" → "Review" → "Done". Occasionally, tickets can move backwards (from "Review" → "Doing", if edits need to be made) or are not always created in the "New" stage. As the manager of the team, you always strive for better and more efficient ways to manage tickets. You are especially interested in uncovering how much time, on average, tickets spend in stage “Doing”. Understanding this could reveal bottlenecks and improve your team’s efficiency.

To answer the question, you decide to build a solution that reads the data, tracks each ticket’s time in stages, and zooms in on the "Doing" stage. Keep in mind, though, that you should not include tickets that are stuck in the "Doing" stage, as there is no way to know how long they will stay there for. Ready to uncover what’s really happening by calculating the average number of days tickets spend in "Doing"?

Beginner-friendly objectives: 1. Read the ticket movement data from a CSV file. 2. Convert string columns to date format, and lag the Move date column. 3. Compute how many days each ticket spends in each stage. 4. Focusing only on the "Doing" stage, calculate the average number of days tickets spend in this stage.

Intermediate-friendly objectives: 1. Notice that some tickets have visited the "Doing" stage more than once. Before computing the average number of days tickets spent in "Doing", make sure that tickets that have visited the "Doing" stage multiple times get properly accounted for as a single ticket.

Author: Roberto Cadili

Dataset: Ticket Data on KNIME Community Hub

Solution Summary: To tackle this challenge, we begin by reading the ticket movement data and converting the date strings into a proper date format for accurate time calculations. We then loop through each ticket ID, create a lagged version of the Move date column, and calculate the time spent in each stage by finding the difference between the current and previous Move dates. After that, we filter the data to keep only the entries related to the "Doing" stage. For tickets that have visited "Doing" multiple times, we sum the time spent across all visits to get a total per ticket. Finally, we compute the average number of days tickets spend in the "Doing" stage.

Solution Details: We begin our workflow by importing the ticket movement data as a CSV file using the CSV Reader node. Once the data is loaded, we convert the "Create date" and "Move date" columns from string to date format using the "yyyy-MM-dd" pattern. To maintain data quality, we configure the conversion step to fail processing if any parsing errors occur. Next, we loop through the data grouped by Ticket ID, allowing us to process each ticket individually. Within each group, we create a lagged version of the "Move date" column by shifting it down one copy, preparing it for duration calculations. Since the first lagged value is typically missing, we use the Rule Engine node to define a rule that fills it in with the corresponding "Create date", ensuring that all duration calculations have a valid start date. We then calculate the number of days each ticket spends in a given stage by computing the difference between the lagged Move date and its current version. After completing these calculations for each ticket, we collect the results from the loop. With all stage durations calculated, we filter the data to keep only the rows where the "From stage" is "Doing." For tickets that visit the "Doing" stage more than once, we sum the days spent across all visits to get a total duration per ticket using the Row Aggregator node. Finally, we compute the average number of days tickets spend in the "Doing" stage, and rename the result column to "Avg days in Doing" for clarity.

See our Solution in KNIME Community Hub

Level: Medium

Description: As a data-driven DJ, you are tasked with curating the perfect playlist to keep the crowd dancing during a two-hour event focused on Indian music. The datasets you are using comprise songs from 15 Indian languages, giving you a diverse range of tracks to work with. Your goal is to select songs with the highest danceability scores, ensuring that each track contributes to an energetic atmosphere throughout the event. You will use a dataset to choose the best tracks, sort them by danceability, calculate cumulative durations, and filter the playlist to stay within the two-hour limit.

Beginner-Friendly Objective(s): Load and preprocess the data. Sort the songs based on their danceability scores, focusing on the highest scores first.

Intermediate-Friendly Objective(s): Convert song durations from "HH:MM" format to total seconds. Calculate cumulative durations, starting from the top songs. Filter the songs to ensure the total playlist duration does not exceed two hours.

Author: Michele Bassanelli

Datasets: Music Data on KNIME Community Hub

Solution Summary: The solution involves a series of data processing steps to create a playlist of songs that fits within a two-hour duration. The workflow begins by listing files in a specified directory and reading song data from a CSV file. It then sorts the songs by danceability and converts their durations into seconds. The Moving Aggregator node is is used to calculate cumulative durations, and a filter is applied to ensure the total duration does not exceed two hours. The final output is a table view displaying the list of selected songs.

Solution Details: The workflow starts with the List Files/Folders node, which lists all files in the "songs" directory, excluding subfolders and hidden files. The CSV Reader node then reads the song data for each of the 15 CSV files. The Sorter node sorts the songs by the danceability column in descending order. An Expression node converts the "duration" column from "HH:MM" format to total seconds, appending the result as a new column. The Moving Aggregator node is used to calculate the cumulative duration in seconds starting from the top row. The Row Filter node ensures the total duration does not exceed 7200 seconds (two hours). Finally, the Column Filter node selects relevant columns, and the Table View node displays the final playlist.

See our Solution in KNIME Community Hub

Level: Easy

Description: You love games and saving money, so you decided to automate the exploration of current gaming deals under 30 dollars using the CheapShark API. After retrieving the deals, you also want to process and display the top best offers you found.

API: https://apidocs.cheapshark.com/#b9b738bf-2916-2a13-e40d-d05bccdce2ba

Beginner-friendly objectives: 1. Retrieve data from the CheapShark API using a GET request; 2. Parse the JSON response to extract relevant fields such as titles, sale prices, and deal ratings; 3. Filter and clean the dataset to retain only the most relevant information, such as games with non-zero user ratings; 4. Remove duplicate entries based on game titles to ensure a clean dataset.

Author: Thor Landstrom

Solution Summary: The solution involves retrieving gaming deal data from the CheapShark API and processing it to extract key information such as titles, sale prices, and ratings. The workflow filters out games with zero user ratings and removes duplicate entries to ensure data quality. Finally, the processed data is displayed in a tabular format, highlighting the top gaming deals available.

Solution Details: The workflow begins with a GET Request node configured to fetch data from the CheapShark API, specifically targeting deals under 30 USD. The response is stored in a column named "body," and relevant response headers are extracted. Next, a JSON Path node processes the "body" column to extract specific fields: titles, sale prices, deal ratings, metacritic scores, and steam rating percentages. These fields are stored in new columns while retaining the original "body" column. A Column Filter node is then used to refine the dataset by including only the columns "titles," "salePrices," and "dealRatings," while excluding unnecessary columns like "Status," "Content type," and "body." The Ungroup node follows, splitting list-type collections into individual rows and removing the original collection column. The Row Filter node is configured to retain only rows where games have non-zero user ratings, ensuring that only rated games are included. A Duplicate Row Filter node is then applied to remove duplicate entries based on the "titles" column, keeping the first occurrence of any duplicate rows. Finally, a Table View node is used to display the processed data, showcasing the top gaming deals on sale.

See our Solution in KNIME Community Hub

Level: Hard

Description: You’re a data analyst supporting various Finance departments in a big company, and your main goal is to upgrade their processes with the latest tools and technology. It's January 2nd, and all supplier invoices from the previous quarter need to be processed. Can you assist the finance department and find ways to:

1. Read 500+ e-invoices in XML format and extract relevant data;
2. Assist with the creation of a management reporting package;
3. Assist with performing internal controls -- i.e., taking into consideration two datasets provided by the Procurement department you should identify any invoices that (a) do not have a PO Number, (b) do not match a PO Number issued by the company, and (c) have an invoice date which is before the corresponding Purchase Order Date;
4. Assist with a report to be provided to Financial Planning & Analysis to inform their forecast. The report should show any amounts remaining on a purchase order where not all items were invoiced in full. To determine the timing of invoicing in the future, assume that all remaining items on a Purchase Order will be invoiced at once exactly 14 months after the first invoice date. The report should also outline the top 3 suppliers by remaining amount on invoices, as well as the top 3 products. Note: There are two ways that Purchase Orders can have been partially invoiced -- it is possible that, e.g., out of 5 line items, only 2 have been fully invoiced; and it is also possible that line items that have a quantity larger than 1 have only partially invoiced (e.g., 5 units were ordered on item 1 and only 3 were invoiced).

Beginner-friendly objectives: 1. Extract and preprocess data from XML and table files, ensuring the data is clean and ready for analysis; 2. Perform basic data transformations, such as converting string data to numerical and date formats.
Intermediate-friendly objectives: 1. Implement data aggregation and filtering techniques to summarize and refine the dataset for further analysis; 2. Create visualizations to represent the aggregated data, focusing on key metrics and insights.
Advanced objectives: 1. Integrate multiple data sources and perform complex joins to enrich the dataset with additional information; 2. Develop a comprehensive report that includes all visualizations and insights, ready for presentation or further analysis.

Author: Martin Dieste

Dataset: Invoice Data in KNIME Community Hub (unzip it to use it)

Solution Summary: Each given task gradually increases in complexity. Solving task 1 requires Reading XML data and extracting relevant information using XPath. Task 2 focusses on aggregating data using GroupBy and Filtering using Top K Row Filter before visualizing using Bar Chart Views inside a component. Task 3 requires filtering and advanced filtering using Row Splitter and Reference Row Splitter as well as data wrangling to combine Invoice Data with Purchase Order Header Data using Value Lookup. Additionally Date conversion using String to Date&Time is required, before applying the correct rules using Rule Engine to flag relevant items in the data. Task 4 requires advanced data wrangling to join Purchase Order Header and Item Data with Invoice Data as well as a variety of processing and aggregation steps to prepare the data for visualization inside a component. This component is enabled for reporting and its output is written as PDF to the Current Workflow Data Area.

Solution Details: The solution involves a multi-step workflow to solve the different tasks 1-4. Task 1 begins with data extraction from invoices in XML Format using XML Reader and XPath nodes. For task 2, the data is further processed and aggregated by supplier and product using Group By before the top 10 for each category are visualized in Bar Charts inside a component. Task 3 involves filtering the invoice data to identify any invoices without PO reference and for those with PO reference, the data is combined with the Purchase Order Header data to identify invoices referencing numbers not in the data using a Reference Row Splitter. To identify invoices dated before the issue date of the corresponding purchase order, further preprocessing, like converting dates in string format to date&time format is required, before applying a rule engine node to identify invoices that are not OK as they were issued before an order was placed. Solving task 4 requires combining Purchase Order Header and Item Data and then combining it with the Invoice Data from task 1 using Joiner Nodes. With the combined data, the difference in item value between Purchase Order Data and Invoice Data can be determined and those items where the difference is not 0 can e filtered out. After that the invoice date can be shifted by the required 14 month sing Date&Time Shift, before extracting Month and Year from the shifted date. Next the data is grouped by the required dimensions 1) Year & Month, 2) Supplier, 3) Product and the tables are sent to a component which holds three Generic eCharts Nodes to visualize the data. The component has reporting enabled in the Layout Editor Next and receives a Report Template as input and generates a Report as output. This report output is then written as PDF to the Current Workflow Data Area using the Report PDF Writer.

See our Solution in KNIME Community Hub

Level: Easy

Description: You work for a Canadian airline company that wants to better understand its most loyal customers. Before creating personalized offerings for this group, the company needs to uncover their demographics from the available data. As a data scientist, your task is to identify how these loyal customers are distributed across different demographic segments—such as gender and marital status—in each city. Who are the most loyal customers and how are they segmented in different locations? Hint 1: You can use loyalty points for calculating a basic loyalty score or combine multiple factors to create a more comprehensive one.

Beginner-friendly objectives: 1. Calculate Loyalty for each user; 2. Extract the most loyal users along with their information; 3. Find out how these customers are distributed in different demographic groups in each city.

Author: Babak Khalilvandian

Dataset: Airline Customer and Loyalty Data in KNIME Community Hub

Solution Summary: The solution involves reading multiple files, processing and joining them to create a unified dataset. We then perform data aggregation and filter the resulting dataset to extract customer loyalty. Quantile calculation and pivot operations are used to analyze customer data across different dimensions, providing a view of customer demographics across cities and personas.

Solution Details: We start by reading three CSV files using CSV Reader nodes, each configured to handle specific file paths, delimiters, and encoding settings. The first file contains customer flight activity data, the second contains loyalty history, and the third contains metadata. The data is then aggregated using a GroupBy node, which groups data by the "Loyalty Number" and sums the "Points Accumulated" and "Points Redeemed" columns. An Expression node calculates the difference between these sums, creating a new "Loyalty" column. Next, a Joiner node combines the datasets based on the "Loyalty Number" column, ensuring strict data type and value matches. The workflow then filters rows using Row Filter nodes, retaining only those with missing values in "Cancellation Month" and "Cancellation Year", and those with a "Loyalty" value of 0.0 or higher. Another GroupBy node calculates the 90th percentile of the "Loyalty" column, grouped by "Country," "Province," and "City." A second Joiner node merges this aggregated data with the filtered dataset, and an Expression Row Filter node retains rows where the "Loyalty" value is in the top 10%. Finally, Pivot nodes transform the data by counting occurrences of "Loyalty Number" across different dimensions such as "Gender," "Marital Status," and "Education," grouped by geographical columns. This detailed workflow provides a comprehensive analysis of customer loyalty, leveraging advanced data manipulation and aggregation techniques in KNIME.

See our Solution in KNIME Community Hub

Level: Medium

Description: You are working as a data analyst for a top European football club that is looking to recruit new talent for the upcoming season. The club has provided you with the Football Players Stats 2024-2025 dataset and wants to use AI-powered scouting to find undervalued players, rising stars etc. To answer these questions, you have been provided with three sample prompts. Either use those prompts or be creative with your own prompts and come up with a report for the scout. Note: It is not mandatory to use Open AI LLM models. You can also use local LLMs.

Prompt ideas:
- "Summarize the strengths and weaknesses of Player X using the provided data."
- "Who is the most well-rounded midfielder based on - available statistics?"
- "Which players have similar playing styles to Player X based on their stats?”

Beginner-friendly objective(s): 1. Set up the initial data reading process by configuring the CSV Reader node to import the dataset. 2. Filter the dataset based on specific criteria using the Row Filter node.

Intermediate-friendly objective(s): 3. Convert the filtered data into JSON format and manage flow variables for dynamic workflow control. 4. Create and configure prompts for the language model using Variable Expression nodes. 5. Integrate the language model interaction by setting up the LLM Prompter nodes to generate responses based on the prompts. 6. Compile the final report by configuring the Report PDF Writer node to output the results.

Author: Sanket Joshi

Dataset: Football Players Dataset in KNIME Community Hub

Solution Summary: The solution involves a multi-step workflow that begins with reading and filtering a dataset of player statistics. The filtered data is then converted into JSON format, and flow variables are managed to dynamically control the workflow. Prompts are created for a language model, which generates responses that are compiled into a final report.

Solution Details: The workflow starts with a CSV Reader node configured to import a dataset from a local file. The data is then filtered using a Row Filter node, which selects rows based on specific conditions such as competition, age, and nationality. The filtered data is converted into JSON format using the Table to JSON node, which is configured to handle a wide range of columns and omit missing values. Flow variables are managed using the Table Row to Variable node, which converts table rows into variables for dynamic workflow control. Next, Variable Expression nodes are used to create prompts for the language model. These nodes concatenate strings and flow variables to generate specific prompts for analysis. The OpenAI Authenticator node is configured to authenticate with the OpenAI API, ensuring secure access to the language model. The LLM Prompter nodes are then set up to interact with the language model, using the generated prompts to obtain responses. These nodes are configured to handle system messages and store responses in a specified column. Finally, the Report PDF Writer node is configured to compile the results into a PDF report. This node is set to save the report locally, with specific handling for existing files and a defined timeout period.

See our Solution in KNIME Community Hub

Level: Medium

Description: You are a medical researcher working with a hospital to uncover key risk factors behind heart failure. Using an unbalanced dataset of patient records, your task is to build a predictive model to identify potential heart disease cases. But accuracy alone is not enough—clinicians want to understand why the model makes the predictions it does. Train your model, then apply explainable AI (xAI) techniques to reveal the top three features influencing its decisions. Can your insights help doctors detect heart failure earlier and more effectively?

Beginner-friendly objectives: 1. Load and preprocess the heart disease dataset, ensuring that the data is clean and ready for analysis. 2. Perform a train-test split on the dataset, maintaining the class distribution for accurate model evaluation.

Intermediate-friendly objectives: 1. Implement a parameter optimization loop to fine-tune the model's hyperparameters for improved performance. 2. Within the Parameter Optimization Loop, conduct cross-validation to assess the model's robustness and generalization (default: Naïve Bayes, but feel free to experiment with other models). 3. Integrate multiple data science techniques, including one-hot encoding and normalization, to enhance the model's predictive power. 4. Evaluate the model's performance using advanced metrics and visualization techniques to gain insights into its accuracy and reliability. 5. Use the Surrogate Random Forest model from Global Feature Importance to determine the top 3 most important features driving predictions.

What are the top 3 features responsible for the model's predictions?

Author: Keerthan Shetty

Dataset: Heart Failure Dataset in KNIME Community Hub

Solution Summary: The solution involves a comprehensive workflow that begins with reading and preprocessing a datase on heart failure. The data is then split into training and test sets, ensuring the class distribution is maintained. A Naive Bayes model is trained, with hyperparameters optimized through a parameter optimization loop, and cross-validation is employed to validate the model's performance. The final model is evaluated using a variety of metrics to assess its predictive quality. The workflow leverages techniques such as one-hot encoding, normalization, and Bayesian optimization to enhance the model's effectiveness. At the end, the Global Feature Importance component is used to check the importance of different features.

Solution Details: The workflow starts with a CSV Reader node to load the heart failure dataset, followed by a Number to String node to convert the target variable, "HeartDisease," into a string format. The data is then partitioned ino training and test sets using the Partitioning node, with stratified sampling based on the "HeartDisease" column. A Parameter Optimization Loop Start node is used to optimize the def_prob parameter using Bayesian Optimization, with a range of 0.001 to 0.01. Next, the Naive Bayes Learner node is employed to train the model, and the Normalizer node applies Min-Max normalization to the numeric data. One-hot encoding is performed using the One to Many (PMML) node, transforming categorical features into binary variables. The model's predictions are generated using the Naive Bayes Predictor node, and the results are aggregated with the X-Aggregator node. The Scorer node evaluates the model's performance by comparing actual and predicted values, while the Column Splitter node separates the target variable for further analysis. Throughout the workflow, key nodes such as Column Filter, Column Appender, and Table Row to Variable are used to manage data flow and ensure the integrity of the analysis. The workflow uses the Capture Workflow nodes, and send predictions to the Global Feature Importance component along with the test dataset.

See our Solution in KNIME Community Hub

Level: Hard

Description: Emma is an interior designer recently hired to help a new homeowner enhance the aesthetics of several rooms in her house. To kick off the project, Emma is asked to focus on just one room of her choice. The client, having just purchased the property, is currently working with a limited budget: her goal is to refresh the space using what she already has—relying on changes in decoration, style, accessories, or layout to improve the room’s functionality and visual appeal. Emma’s task is to produce multiple photorealistic visualizations showing how the selected room could look after a restyle.

To help Emma, you decide to build a workflow that relies on AI and the generative capabilities of vision models to generate images. Start by selecting a room (e.g., private office) and defining its existing essential furniture pieces (e.g., a desk, chair, bookshelf, or lamp), which must remain consistent throughout the process. Then, generate a base image of the room in its current state. From there, use the model to create three stylistically distinct reinterpretations—e.g., as minimalist, industrial, and bohemian—by modifying only the decorative elements, accessories, or layout. Finally, save the resulting images and display them side by side in a table for easy comparison. Can you design a solution that loops through various styles and prompts the model to produces compelling visual proposals Emma’s client will love?

Beginner-friendly objectives: 1. Connect to a vision model of your choice (for example, OpenAI's GPT Image 1 or DALL-3). 2. Type a prompt to generate a base image of the room with the existing essential furniture pieces (you choose which room and which furniture pieces). Tweak the settings of the image generator to your liking. 3. Save the generated image.

Intermediate-friendly objectives: 1. Instead of typing the prompt directly in the image generator, define in a table the base room (e.g., “private office”) and a few furniture items (e.g., desk, chair) to compose a parameterized prompt that generates the image of the base room. 2. Edit the image of the base room by prompting the vision model a second time (note: OpenAI's GPT Image 1 supports image editing). Ask the model with a parameterized prompt to create three stylistically distinct reinterpretation of the room (you choose which style) by modifying only the decorative elements, accessories, or layout - not the furniture pieces. 3. Display the edited images and save them.

Advanced objectives: 1. Save each edited image immediately after it is generated, rather than waiting until all styles have been processed. Ensure that each image file is named automatically according to the style it represents (e.g., minimalist.png, bohemian.png, industrial.png) for easy identification.

Author: Roberto Cadili

Solution Summary: To solve this challenge, we leverage OpenAI’s image generation and editing capabilities to automate the creation of room redesigns using fixed furniture pieces. The process starts with structured input of room type and essential furniture, followed by generation of a base image. A loop mechanism iterates through predefined styles, and we edit the base image only with decorative elements and layout. Each stylized image is saved and visualized side-by-side for comparative inspection.

Solution Details: We begin our solution by using a Table Creator node to define the room type, list of essential furniture pieces, and the target design styles. To securely connect to the OpenAI API, we configure access using the Credentials Configuration and OpenAI Authenticator nodes. We then process the initial data with a GroupBy node to prepare it for prompt engineering, and use the Expression node to build a parameterized prompt describing the base room. This prompt is injected as a flow variable into the OpenAI Image Generator node, which generates a photorealistic image of the room with its existing furniture. The output image is saved and simultaneously passed as input to a second OpenAI Image Generator node. This node is wrapped within a loop controlled by a Table Row to Variable Loop Start, which iterates over a table of parameterized prompts with predefined styles. For each iteration, we inject a style-specific prompt, requesting the model to edit only decorative elements, layout, and accessories in the input image—while keeping the core furniture unchanged. Edited images are saved immediately after generation using the Image Writer (Table) node within the loop. To ensure that we can dynamically assign filenames based on the style, we configure the Image Writer (Table) node to pick the name from the RowID column. After all iterations complete, we collect the outputs via a Loop End node, transpose the resulting table using a Table Transposer, and display the images side-by-side for comparison using a Table View node.

See our Solution in KNIME Community Hub

Level: Medium

Description: You work for an entertainment company that wants to analyze various TV shows to gain insights into audience preferences. Your next project focuses on transforming data from the show “Rick and Morty” into a structured dataset that can be easily analyzed and queried. As a first step, you need to extract a complete list of characters featured in the series. However, the API that provides this information is paginated, returning separate JSON files for each page. How can you use KNIME to retrieve and merge all these pages into one unified catalog of the show’s characters? To solve this challenge, use the https://rickandmortyapi.com/api url.

Beginner-friendly objective: 1. Successfully send a Get Request to receive the characters data from the API.

Intermediate-friendly objectives: 1. Implement a recursive loop to handle paginated API responses and ensure all data is collected. 2. Unify all the pages into a table with two string columns, id and character name.

Author: Babak Khalilvandian

Solution Summary: The solution involves creating a KNIME workflow that interacts with the Rick and Morty API to fetch character data. It uses a recursive loop to handle paginated responses, extracting and transforming JSON data into a structured format. The workflow filters and processes the JSON data to focus on specific character attributes, transforming the data into tabular format.

Solution Details: The workflow begins with a GET Request node configured to fetch data from the Rick and Morty API. The response is processed using a JSON Path node to extract the "characters" element, storing it in a new column named "next page". A Column Filter node retains only the "next page" column, preparing the data for recursive processing. A Recursive Loop Start node initiates the loop, allowing the workflow to handle paginated API responses. Another GET Request node dynamically uses the "next page" URI to fetch subsequent pages. The response is filtered again using a Column Filter node to retain essential columns. The JSON data is converted into a table format using a JSON to Table node, expanding the structure to a manageable depth. A JSON Path node extracts the "next" value from the JSON, storing it in the "next page" column for further iterations. Another Column Filter node ensures only the necessary columns are retained. The Recursive Loop End node concludes the loop after a maximum of 42 iterations, ensuring all data is collected. A final Column Filter node focuses on the "results" column, which is then ungrouped using the Ungroup node to handle collection data types. The workflow concludes with a JSON Path node extracting specific fields like "id" and "name" from the JSON data, transforming them into structured columns for analysis.

See our Solution in KNIME Community Hub

Level: Medium

Description: You are a data scientist working for a grocery store that focuses on wellness and health. One of your first tasks in your new job is to go over the grocery's inventory and find patterns in the items they sell, based on nutritional composition. This will help them assess if they need to tweak their offerings, and where, to match their ethos of wellness and health.

Beginner-friendly objectives: 1. Load and normalize the grocery data. 2. Cluster the data based on its numeric values using an unsupervised learning algorithm such as k-Means. 3. Denormalize the data after clustering it.

Intermediate-friendly objectives: 1. Visualize the clustering results using scatter plots and analyze the distribution of clusters. Use flow variables to dynamically control the scatterplot and enhance interactivity. 2. Perform dimensionality reduction using PCA to simplify the dataset while retaining essential information. 3. Visualize the results with scatterplots as well.

What patterns can you find? What recommendations and insights can you come up with based on these patterns?

Author: Aline Bessa

Dataset: Groceries Dataset in KNIME Community Hub

Solution Summary: The solution involves clustering the normalized data to find data groupings based on nutritional attributes. We then create two components for the visualization of results: one uses an important dimensionality reduction technique named PCA to project the data onto two dimensions of high variance, and the other implements an interactive scatterplot for users to check the clustered data using different nutritional attributes as axes.

Solution Details: The workflow starts with the CSV Reader node configured to read grocery data from a file named "food.csv”. The Normalizer (PMML) node is used to apply Min-Max normalization to all numeric columns, scaling them between 0.0 and 1.0. Next, the k-Means node clusters the data into three groups using nutritional attributes, with centroids initialized from the first rows. The data is then denormalized to facilitate visualization and interpretation. In one component, the PCA node reduces the data to two dimensions of very high variance, retaining the original columns in the output. The Column Filter node retains only the PCA dimensions and cluster information for visualization, and an interactive scatter plot is created using the Scatter Plot (JavaScript) node, configured to display PCA results and clustering outcomes. In a second component, Single Selection Widget nodes allow users to pick two different nutrients to work as axes in a scatterplot of the data points, which are plotted in their assigned cluster color. The final steps of both components involve sorting and sampling the data to provide insights into the grocery items, with results displayed in Table View nodes for easy exploration.

See our Solution in KNIME Community Hub

Level: Easy
Description: You are a linguist studying linguistic diversity around the world. You have found a dataset that includes information about countries, such as the number of languages spoken, area, and population. The dataset also contains a column called MGS, which refers to the mean growing season in each country (i.e., for how many months per year crops can be grown on average). What are the top 5 countries by the number of languages spoken? What are the top 5 countries by the ratio of languages spoken to population? What are the top 5 countries by the ratio of languages spoken to land area? Finally, do you notice any patterns between the numbers of languages spoken and the MGS values?

Objective 1 (Easy): Learn how to import a CSV file into KNIME.
Objective 2 (Easy): Perform ratio calculations between columns (e.g., number of languages spoken and population size ratio).
Objective 3 (Easy): Sort the resulting table using specific criteria to select top 5 countries.
Objective 4 (Easy): Filter the top rows based on your selected criteria.

Author: Michele Bassanelli

Dataset: Linguistic in the KNIME Community Hub

Solution Summary: We solve this challenge with by computing the ratios between the number of languages spoken in a country and its population and area, and then ranking the countries.

Solution Details: After reading the linguistic dataset with the CSV Reader node, we answer the first question using the Top K Row Filter node, sorting by the "Lang" column. For the second question, an Expression node is used to calculate the ratio of languages to population, followed by another Top K Row Filter node to sort by the newly calculated ratio.
The third question is addressed with a similar approach, but the ratio is calculated between the number of languages spoken and the country’s area.
This challenge was adapted from Statistics for Linguists and uses a modified version of the dataset from Nettle 1999. In this case, the columns that were initially log-transformed are restored to their original values.

See our Solution in KNIME Community Hub

Level: Medium

Description: You have an EV and want to live in a place that has many available charging stations, and where it is also cheap to charge your vehicle. Given a dataset on chargers around the world, you need to find out the top ten cities that have the most EV chargers. You also want to consider which of those ten cities offer, on average, the cheapest KwH rates in cost. You should be narrowing down your city of choice to five after taking into account the costs.

Objective 1 (Easy): Clean data by removing addresses without real city names and extract country out of their addresses.
Objective 2 (Easy): Count the total number of EV charging stations by city and find the top ten cities.
Objective 3 (Easy): Of the top ten cities, find out which cities have the cheapest average cost to charge per kwH and show the five cheapest cities.
Objective 4 (Medium): Create a bar chart that allows you to compare the top ten cities in terms of average cost to charge per kwH. Create a widget that lets you select the cities you want to see in this plot and control the plotting with flow variables.

Author: Thor Landstrom

Dataset: EV data in the KNIME Community Hub

Solution Summary: We solve this problem by grouping the data by city, so that we can count every city's unique EV stations and also calculate their average cost. We sort and filter the data and create visualizations that allow users to compare cities' average EV prices interactively.

Solution Details: After reading the dataset with the CSV Reader node, we preprocess the data. First, we remove rows without proper addresses (Row Filter node) and then extract the addresses in the remaining rows for grouping (Expression node). The next step is to group the data by address with the GroupBy node, and sort the resulting data in descending order by count (Sorter node). We then extract the top 10 cities with the most EV charging stations with the Row Filter node. We use the Sorter and the Row Filter nodes again to extract the top 5 cheapest cities out of these 10, and visualize their average costs with the Table View node. In parallel, we create a component that allows users to compare the top 10 cities with the most EV charging stations in terms of average charging cost. This component has a widget that lets users select the cities they want to see in this plot, which turns into a flow variable that controls the plotting.

See our Solution in KNIME Community Hub

Enjoying our challenges?

They are a great way of preparing for our certifications.

Explore Certification Program

Just KNIME It! Leaderboard

KNIME community members are working hard to solve the latest "Just KNIME It!" challenge - and some of you have solved dozens of them already! Who are the KNIME KNinjas who have completed the most challenges? Click over to the leaderboard on the KNIME Forum to find out! How many challenges have you solved?

See leaderboard

Sign up for reminder emails

First Name *

Last Name *

Email *

Company *

Role *

Department *

Country *

Yes, I’d also like to receive regular news updates from KNIME and I accept the privacy policy.

*KNIME uses the information you provide to share relevant content and product updates and to better understand our community. You may unsubscribe from these emails at any time.

Just KNIME It!

The challenges are back! Register now

Challenge 13 - Stockout Forecasting

Previous challenges

Challenge 12 - Measuring Ticket Processing Time

Challenge 11: Let's Dance

Challenge 10: Find the best deals for PC games

Challenge 9: Processing Invoices at the End of a Quarter

Challenge 8: Airline Customer Loyalty

Challenge 7: AI-Generated Football Scouting Report

Challenge 6: Heart Failure Prediction

Challenge 5: Interior Design with GenAI

Challenge 4: Rick and Morty Catalog

Challenge 3: Nutritional Composition of Groceries

Challenge 2: Linguistic Diversity

Challenge 1: Best Places to Live with an EV

Enjoying our challenges?

Just KNIME It! Leaderboard

Sign up for reminder emails

Previous Just KNIME It! Challenges