KNIME logo
Contact usDownload

Just KNIME It!

Angle PatternPanel BG

The word "word" scrambled at the top and correctly arranged at the bottom.

Correcting Postal Addresses

Challenge 10

Level: Medium

Description: You work as a data analyst for a delivery company, and some packages were not delivered last week due to address typos. Thanks to the postal carriers, addresses that were not found due to typos were marked as such. Given a dataset with successful deliveries (due to no typos) and unsuccessful ones (due to typos), your goal is to automatically fix the incorrect addresses by leveraging the correct ones.

Author: Aline Bessa

Dataset: Postal Data on KNIME Community Hub

Solution Summary:
To tackle this challenge, we first separate the addresses that have typos from those that do not. Next, for each address with a typo, we find the correct address that is the most similar to it, and then replace it.

Solution Details: After reading the dataset with postal addresses with the CSV Reader node, we use the Row Splitter node to separate those with typos from those that are correct. Next, we remove duplicate addresses with the Duplicate Row Filter node and use the String Matcher node to identify, for each incorrect address, the most similar correct one. This information is used to fix each incorrect address with the String Replacer (Dictionary) node.
See our Solution in KNIME Community Hub

Previous Challenges


Level: 
Medium

Description: 
Recently you became more interested in finance, and since you want to learn more about web scraping for work, you decided to unite both interests. Using the KNIME Web Interaction extension, can you navigate to the Economic News section on Yahoo Finance, extract the headers of only the most recent topics that pop up on the webpage, and then make sense of the results visually? Remember to filter out any ads or unrelated banners/headers/content. Hint: Find class tags in the news' XML that are unique to the content you are scraping.

Author:
Thor Landstrom

Solution Summary:
To tackle this challenge, we connect to a browser through KNIME Analytics Platform, fetch the most recent content from the Economic News section on Yahoo Finance, extract its headers, and then visualize their corresponding topics as a table.

Solution Details:
We start our solution by connecting to a browser with the Web Interaction Start node. We then navigate to the Economic News page on Yahoo Finance using the Navigator node. We retrieve the XML content of this page, including heading tags and text, with the Content Retriever node. In parallel, since we do not need a browser connection anymore, we close it with the Web Interaction End node. As for the retrieved content, we use the Row Filter node to remove headings that are not tagged as "h3", and then a combination of the XPath node and a second instance of the Row Filter node to identify and remove rows that contain ads or unrelated pages. Finally, we isolate the heading texts with the Column Filter node and visualize them as a table with the Table View node.

See our Solution in KNIME Community Hub


Level: 
Medium

Description: 
You are reorganizing a data warehouse in your company, working with a filesystem that creates parent folders if you give it a reference for a child folder. For example, if you ask the filesystem to create “folder1/folder2” and neither folder1 or folder2 exist, it will create both, with folder2 inside folder1, without raising an error. Given a list of folders, you want to keep only the longest unique child folders, filtering out references to parent folders that will be generated anyway for efficiency.

Here's an example of an initial list of folders:

- folder1/folder3
- folder1/folder3/folder22
- folder1/folder3/folder22/folder47

After executing your workflow, the list above should only contain a reference for folder1/folder3/folder22/folder47.

Author:
Emilio Silvestri

Datasets: 
Folder Data in the KNIME Community Hub

Solution Summary:
After reading the list of folders, we calculate the depth (number of levels) for each one of them. Next, we iterate over all folders searching for parent folders and making sure that we just keep the "deepest reference" with a path in common. We also create references for paths that got excluded, indicating what redundant child/subfolder was responsible for it.

Solution Details:
After reading the list of folders with the Table Reader node, we split them into columns with the Cell Splitter node and, with a combination of nodes (Column Aggregator, Math Formula, Column Filter), end up with the number of levels (depth, or cardinality) present in each folder reference. Next, we use the Table Row to Variable Loop Start and Variable Loop End nodes to iterate over all folders. At each interaction, we use the String Manipulation and Row Filter nodes to find and isolate all parent folders for the folder in question. By combining the Table Row to Variable and Rule Engine Variable nodes, we make sure that we just keep the folder in question if it is the deepest, using identified parent folders as comparison.

See our Solution in KNIME Community Hub


Level: 
Medium

Description: 
You work as a freelance photo reporter for wildlife magazines. In your daily work you take a lot of pictures, usually in .JPG format and in different sizes. To be able to sell your photographs to magazines, you need to accommodate their different sizing and formatting requests. To streamline this process, you decide to build a workflow that automates the following, sequentially: (1) Image resizing -- create a configurable component with three options: do nothing, reduce to fixed size (150x150), or reduce size keeping ratio; (2) image format conversion -- create a configurable component with two options: .PNG or .SVG; (3) save edited images on your machine.

Author:
Roberto Cadili

Datasets: 
Image Data in the KNIME Community Hub

Solution Summary:
Our solution to this challenge contains two configurable components that let users (1) resize their images in different ways (do nothing, keep the ratio, or use a fixed 150x150 format), and then (2) convert them into .PNG or .SVG format. The images are then saved locally.

Solution Details:
We start our solution by using the List Files/Folders node to get a list of local images in .JPG format. We then use the Path to String node to facilitate the reading of these images, and import them to KNIME Analytics Platform with the Image Reader (Table) node. The images are sent to our first component, Image Resizer. With the Single Selection Configuration node, we allow users to configure this component based on resizing option (do nothing, keep the ratio, or use a fixed 150x150 format). A CASE Switch Start node gets the chosen option and either activates no branch, going straight to the CASE Switch End node, resizes the images in a fixed way with the Image Resizer node, or resizes the images using a ratio of 0.3 also with another instance of the Image Resizer node. The resized images are then passed to a second component, named Image Converter. This component also uses an instance of the Single Selection Configuration node to let users pick the format they want the images to be in (.PNG or .SVG). The images inside the component are initially converted to .PNG and then passed as input to an instance of the CASE Switch Start node. If the chosen option is .PNG, this node activates a branch that simply ungroups the images with the Ungroup node. If the chosen option is .SVG, the images are also ungrouped with another instance of the Ungroup node but are turned into .SVG with the Renderer to Image node. Both branches meet as inputs for the CASE Switch End node. Outside this second component, we use the String Manipulation node to create a filename column, and then save the edited images locally with the Image Writer (Table) node.

See our Solution in KNIME Community Hub

Level: Medium

Description: 
As the 2024 European Football Championship (UEFA) unfolds, let's dive into football history with a data challenge. Today you are asked to create a data app that allows users to check, for any timeframe, what the top three teams with the most football victories were. Who are the top three teams of all time? And who were the top three teams in the 1980s?

Author:
Michele Bassa

Datasets: 
Football Data in the KNIME Community Hub

Solution Summary:
After reading the football data and determining wins, losses, and ties, we create a data app that allows users to pick a temporal interval and then check which three teams had the most victories.

Solution Details:
We start our solution by reading the football data with the CSV Reader node, transforming dates into Date format in the node's Transformation tab. Next, we use the Rule Engine node to determine wins, losses, and ties for home teams. This data is then sent to a component (data app) that allows for the temporal filtering of the data. Two instances of the Date&Time Widget node let users select the start and end dates of a temporal period, for which a team ranking will be calculated. The selected dates are passed to two instances of the Date&Time-based Row Filter node, reducing the data to a specific period. After that, two parallel branches use the Row Filter, Column Filter, and GroupBy nodes to select those matches in which the home team (top branch) or away team (bottom branch) wins. Both victory numbers are combined with the Joiner node, and then the Top k Row Filter node selects the top three best teams for the selected period. This information is then plotted with the Bar Chart node.

See our Solution in KNIME Community Hub

Level: Medium

Description: 
As a member of a think tank, your task is to craft a report on LGBTQIA+ representation in political discourse. Given a EU dataset gathering responses from LGBTQIA+ individuals across all member states, you decide to start your work by  investigating the answers to the following question: "In your opinion, how widespread is offensive language about lesbian, gay, bisexual, and/or transgender people by politicians in the country where you live?”.

Use a map to present the results effectively.

Author:
Michele Bassa

Datasets: 
LGBTQIA+ Survey Data in the KNIME Community Hub

Solution Summary:
To tackle this challenge, we reduce the scope of the data to question "In your opinion, how widespread is offensive language about lesbian, gay, bisexual and/or transgender people by politicians in the country where you live?". We then filter the answers and only keep the most common ones: "rare" and "widespread". This facilitates the understanding of trends and patterns across countries. We compute the percentages of answer "widespread" for every country and also compute their map coordinates. Finally, we join the geospatial information and the computed percentages and plot them in a map.

Solution Details:
After reading the survey dataset with the CSV Reader node, we prepare the data by reducing it to question "In your opinion, how widespread is offensive language about lesbian, gay, bisexual and/or transgender people by politicians in the country where you live?", and to its two most common answers, "rare" and "widespread". We also group the data by country, keeping the totals for both answers. We loop over this data (Group Loop Start and Loop End nodes) to compute the percentages of answer "widespread" for every country, using the Math Formula node (we compute the denominator for these percentages with the Moving Aggregator node). Next, we run another loop (Table Row to Variable Loop Start and Loop End nodes) to find the map coordinates of each country with the OSM Boundary Map node. We join the previously calculated percentages to the data with the map coordinates (Joiner node), use the Projection node to improve formatting for visualization, filter irrelevant data with the Row Filter node, and then finally plot the computed information with the Geospatial View node.

See our Solution in KNIME Community Hub

Level: Medium

Description: You work for the United Nations and want to discuss how the causes of death vary across the European Union (EU). You know how to analyze data and generate insightful visualizations, but the data you have at hand is a bit challenging: the meaning of its different columns and codes is not clear. To conclude your work well, you will  have to integrate this data with some metadata in XML format, making sense of the different death causes and data attributes. What patterns can you find in the different countries?

Author: Emilio Silvestri  

Datasets: Demographic Data from the EU in the KNIME Community Hub

Solution Summary:
Our solution to this challenge can be split into two steps. First, we identify the code for the top cause of death in each country, regardless of sex or age; next, we match these codes with metadata describing what they are and sort the countries based on these descriptions. For 27 (out of 35) countries, "diseases of the circulatory system" is the main cause of death; for 8 (out of 35) countries, the top cause of death is "neoplasms".

Solution Details: With the CSV Reader node, we ingest the dataset on EU death causes in 2021. Next, with a series of Column Filter and Row Filter nodes, we reduce the dataset to what is pertinent to the analysis. It lists codes for causes of death per country regardless of sex and age. We then use a loop (Group Loop Start and Loop End nodes) to identify what is the top code for cause of death per country, employing the Top k Row Filter node. At the end of this branch, we have the codes that correspond to top death causes all over the EU, but cannot make sense of them yet. To this end, in parallel, we ingest metadata on the death causes with the XML Reader node. Using a series of XPath nodes, we extract column names, descriptions, and other values from the metadata. The descriptions and values come in lists, and to facilitate their posterior matching with death cause codes from the original dataset, we use the Ungroup node to break the lists into single tokens. We filter the resulting data to only keep rows that correspond to causes of death (Row Filter node), and then use the Value Lookup node to match these causes with their codes in the original dataset. Finally, we sort the data with the Sorter node and get to the conclusion that the top cause of death in most EU countries has to do with diseases of the circulatory system.

See our Solution in KNIME Community Hub

Level: Easy

Description: You are a real estate agent working in a new city, and to perform well your first task involves understanding the houses in the region better. A colleague shares a dataset with you and now it’s time for you to explore it. What has been the average housing price, lot size (in acres), and living space (in sqft) in this city, according to her dataset? How are prices distributed and correlated with housing features? What other insights can you gather from this dataset?

Author: Thor Landstrom 

Dataset: Real Estate Data in the KNIME Community Hub

Solution Summary: To tackle this challenge, we compute some general statistics of the dataset such as average price, lot size, and living space. We also calculate the linear correlation for all pairs of numerical features, uncovering which housing attributes have the largest connection with their price. On average, central Seattle is the priciest area in the region, but there are a few other relevant clusters to the south and to the east.

Solution Details: After ingesting the housing data with the CSV Reader node, we compute Pearson's linear correlation for all pairs of numerical attributes with the Linear Correlation node. The results are plotted with the Heatmap (JavaScript) node, revealing which housing attributes relate the most to their price. In parallel, we use the Column Filter node to remove unnecessary columns, and convert the lot size information into acres with the Math Formula node. We use the Statistics View node to get important housing summaries, including their average lot size and price, and group the data with the GroupBy node by zipcode. In the aggregation, we calculate the average housing price per zipcode and their median latitude and longitude values. The Lat/Lon to Geometry node uses the median values per zipcode to generate geometries, which are then visualized with the Spatial Heatmap node.

See our Solution in KNIME Community Hub

Level: Easy

Description: You are a climate scientist studying CO₂ emissions. To make your research insights more accessible to your colleagues, and then write a paper about it, you decide to build a report-enabled component in KNIME that allows users to check how emissions vary for different regions and sources. What are the most alarming insights illustrated in such report?

Authors: Armin Ghassemi Rudd and Marina Kobzeva  

Dataset: CO₂ Emissions Data in the KNIME Community Hub

Solution Summary: To tackle this challenge, we manually select the country that ranks highest in terms of CO₂ emissions and create a PDF report showing its historical emissions, how they vary per capita throughout the years, and what sources they are mostly tied to. Different countries can be selected based on their ranking, leading to different visualizations and reports.

Solution Details: After reading the dataset with the Table Reader node, we use the Row Filter node to select a country based on its CO₂ emissions' ranking. Next, we finish our preprocessing by using the Number Format Manager node, selecting how many decimals we want to use in the CO₂ and CO₂ per capita numbers of our report. We create a component named "Report" that contains a few visualizations for our data: two line plots (Line Plot node) for the historical emissions of CO₂ and CO₂ per capita, and a bar chart (Bar Chart node) showing a breakdown of these emissions for different sources. To turn these visualizations into a PDF report, we feed this component with a report template (A4 Landscape) that is specified with the Report Template Creator node. After the component executes, its visualizations are saved as a PDF report with the Report PDF Writer node.

See our Solution in KNIME Community Hub

Level: Easy

Description: You work in finance and one of your clients wants to understand the value of different company stocks over time. Given a dataset of stock prices, you decide to use simple moving averages (window length = 20) to tackle this task. What companies have an upward trend for the most recent data? And what companies have a downward trend?

Author: Thor Landstrom

Dataset: Stock Data in the KNIME Community Hub

Solution Summary: We propose two different solutions to this challenge. The simplest one involves manually filtering the data for a specific company, calculating its moving average, and then visualizing it with a line plot. The second one relies on a simple data app: a company is selected from a dropdown box and its stock prices are selected, a moving average is computed, and the final points are plotted as a line plot.

Solution Details: Both solutions have a core part in common. After the rows for a company are selected, we use the Column Filter node to isolate dates and close prices, do some typecasting with the String to Date&Time node, sort the data from oldest to most recent with the Sorter node, and then use the Moving Average node to compute simple moving averages (window length = 20). Next, we visualize the results with the Line Plot node. In the simplest solution, we use the configuration of the Row Filter node to select the data for a company. In the more complex solution, we get all company names with the "Get company names" metanode, and then pass them, along with the original data, to the "Visualize company stock prices" component. Inside this component, a Single Selection Widget node allows the selection of one of the company names, which in turn is used to control an instance of the Row Filter node. After that, this solution is basically equivalent to the simplest one.

See our Solution in KNIME Community Hub

Here is how the challenges work:

     We post a challenge on Wednesday.
     You create a solution with KNIME Analytics Platform.
     Upload it to your public KNIME Community Hub Space.
     Check your rank on the Just KNIME It Leaderboard.

Our solution to the challenge comes out on the following Tuesday.

Enjoying our challenges?

They are a great way of preparing for our certifications.

Explore Certification Program

Just KNIME It! Leaderboard

KNIME community members are working hard to solve the latest "Just KNIME It!" challenge - and some of you have solved dozens of them already! Who are the KNIME KNinjas who have completed the most challenges? Click over to the leaderboard on the KNIME Forum to find out! How many challenges have you solved?

Sign up for reminder emails

 

*KNIME uses the information you provide to share relevant content and product updates and to better understand our community. You may unsubscribe from these emails at any time.

Previous Just KNIME It! Challenges

Check out previous seasons of Just KNIME It!