KNIME news, usage, and development

17 Jul 2017rs

Continental, a leading automotive supplier, recently won the Digital Leader Award 2017 in the category “Empower People” for bringing big data and analytics closer to its employees with KNIME. Arne Beckhaus is the man behind this project. We are lucky enough today to welcome him for an interview in our KNIME blog.

Read more


03 Jul 2017Dario Cannone

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: OCR on Xerox Copies meets Semantic Web. Have Evolutionary Theories changed?

Author: Dario Cannone, Data Analyst, at Miriade S.P.A., Italy

The Challenge

Scientific theories are not static over time. As more research studies are completed, new concepts are introduced, new terms are created and new techniques are invented. This is of course also true for evolutionary theories. That is, evolutionary theories themselves have evolved over time!

In today’s challenge we are going to show how the theory of evolution has evolved from the first Darwin’s formulation to the most recent discoveries.

The foundation stone of evolutionary biology is considered to be the book “On the Origin of Species” (1859) by Charles Darwin. This book contains the first revolutionary formulation of the theory of evolutionary biology. Even though the book at the time has produced a revolution in the approach to species evolution, many of the concepts illustrated there might seem now incomplete or even obsolete. Notice that it was published in 1859, when nothing was known about DNA and very little about genetics.

In the early 20th century, indeed, the Modern Synthesis theory reconciled some aspects of Darwin’s theory with more recent research findings on evolution.

The goal of this blog post is to represent the original theory of evolution as well as the Modern Synthesis theory by means of their main keywords. Changes in the used keywords will reflect changes in the presented theory.

Scanned Xerox copies of Darwin’s book abound on the web, like for example at http://darwin-online.org.uk/converted/pdf/1861_OriginNY_F382.pdf. How can we make the contents of such copies available to KNIME? This is where Optical Character Recognition (OCR) comes into play.

On the other side, to find a summary of the current evolutionary concepts we could just query Wikipedia, or better DBPedia, using semantic web SPARQL queries.

Xerox copies on one side, read via OCR, and semantic web queries on the other side. Will they blend?

Topic. Changes in the theory of evolution.

Challenge. Blend a Xerox copy of a book with semantic web queries.

Access Mode. Image reading, OCR library, SPARQL queries.

Read more


19 Jun 2017knime_admin

Authors: Andisa Dewi and Kilian Thiel

In a social networking era where a massive amount of unstructured data is generated every day, unsupervised topic modeling has became a very important task in the field of text mining. Topic modeling allows you to quickly summarize a set of documents to see which topics appear often; at that point, human input can be helpful to make sense of the topic content. As in any other unsupervised-learning approach, determining the optimal number of topics in a dataset is also a frequent problem in the topic modeling field.

In this blog post we will show a step-by-step example of how to determine the optimal number of topics using clustering and how to extract the topics from a collection of text documents, using the KNIME Text Processing extension.

You might have read one or more blog posts from the Will They Blend series. This blog post series discussed blending data from varied data sources. In this article today, we’re going to turn that idea on its head. We collected 190 documents from RSS feeds of news websites and blogs for one day (06.01.2017). We know that the documents are divided largely into two categories, sports and barbeques. In this blog post we want to separate the sports documents and barbeque documents by topic and determine which topics were most popular on that particular day. So, the question is will they unblend?

Read more


07 Jun 2017phil

Everyone who has heard of KNIME Analytics Platform knows that KNIME has nodes. Thousands of them! The resources under the Learning Hub as well as the hundreds of public examples within KNIME Analytics Platform are all designed to get you up to speed with KNIME and its nodes. But those that know best how to use KNIME nodes are KNIME users themselves. What if we could capture all their insight and experience in understanding which nodes to use when and in what order and give you a recommendation? Well that is exactly what the KNIME Workflow Coach does.

It gathers the usage data of all KNIME users who have registered to have their data collected anonymously and makes recommendations based on that data to you the user. Since a picture is worth a thousand words, let’s take a brief look at the Workflow Coach in the following short video:

Read more


22 May 2017rs

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: Customer Transactions. Money vs. Loyalty.

Author: Rosaria Silipo
Data Chefs: Haruto and Momoka

Ingredient Theme: Customer Transactions

Today’s dataset is a classic customer transactions dataset. It is a small subset of a bigger dataset that contains all of the contracts concluded with 9 customers between 2008 and now.

The business we are analyzing is a subscription-based business. The term “contracts” refers to 1-year subscriptions for 4 different company products.

Customers are identified by a unique customer key (“Cust_ID”), products by a unique product key (“product”), and transactions by a unique transaction key (“Contract ID”). Each row in the dataset represents a 1-year subscription contract, with the buying customer, the bought product, the number of product items, the amount paid, the payment means (card or not card), the subscription start and end date, and the customer’s country of residence.

Subscription start and end date usually enclose one year, which is a standard duration for a subscription. However, a customer can hold multiple subscriptions for different products at the same time, with license coverages overlapping in time.

What could we extract from these data? Finding out more about customer habits would be useful. What kind of information can we collect from the contracts that would describe the customer? Let’s see what today’s data chefs are able to prepare!

Topic. Customer Intelligence.

Challenge. From raw transactions calculate customer’s total payment amount and loyalty index.

Methods. Aggregations and Time Intervals.

Data Manipulation Nodes. GroupBy, Pivoting, Time Difference nodes.

Read more


08 May 2017knime_admin

Authors: Iris Adä & Phil Winters

The benefits of using predictive analytics is now a given. In addition, the Data Scientist who does that is highly regarded but our daily work is full of contrasts. On the one hand, you can work with data, tools and techniques to really dive in and understand data and what it can do for you. On the other hand, there is usually quite a bit of administrative work around accessing data, massaging data and then putting that new insight into production - and keeping it there.

In fact, many surveys say that at least 80% of any data science project is associated with those administrative tasks. One popular urban legend says that, within a commercial organization trying to leverage analytics, the full time job of one data scientist can be described as building and maintaining a maximum of four (yes 4) models in production - regardless of the brilliance of the toolset used. There is a desperate need to automate and scale the modelling process, not just because it would be good for business (after all, if you could use 29000 models instead of just 4, you would want to!) but also because otherwise we data scientists are in for a tedious life.

At the recent KNIME Spring Summit in Berlin, one of the most well received presentations was that of the KNIME Model Process Factory, designed to provide you with a flexible, extensible and scalable application for running and monitoring very large numbers of model processes in an efficient way.

The KNIME Model Factory is composed of a white paper, an overall workflow, tables that manage all activates and a series of workflows, examples and data for learning to use the Factory.

Read more


24 Apr 2017knime_admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Teradata Aster meets KNIME Table. What is that chest pain?

Author: Kate Phillips, Data Scientist, Analytics Business Consulting Organization, Teradata

The Challenge

Today’s challenge is related to the healthcare industry. You know that little pain in the chest you sometimes feel and you do not know whether to run to the hospital or just wait until it goes away? Would it be possible to recognize as early as possible just how serious an indication of heart disease that little pain is?

The goal of this experiment is to build a model to predict whether or not a particular patient with that chest pain has indeed heart disease.

To investigate this topic, we will use open-source data obtained from the University of California Irvine Machine Learning Repository, which can be downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. Of all datasets contained in this repository, we will use the processed Switzerland, Cleveland, and VA data sets and the reprocessed Hungarian data set.

These data were collected from 920 cardiac patients: 725 men and 193 women aged between 28 and 77 years old; 294 from the Hungarian Institute of Cardiology, 123 from the University Hospitals in Zurich and Basel, Switzerland, 200 from the V.A. Medical Center in Long Beach, California, and 303 from the Cleveland Clinic in Ohio.

Each patient is represented through a number of demographic and anamnestic values, angina descriptive fields, and electrocardiographic measures (http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names).

In the dataset each patient condition is classified into 5 levels according to the severity of his/her heart disease. We simplified this classification system by transforming it into a binary class system: 1 means heart disease was diagnosed, 0 means no heart disease was found.

This is not the first time that we are running this experiment. In a not even that remote past, we built a Naïve Bayes KNIME model on the same data to solve the same problem. Today we want to build a logistic regression model and see if we get any improvements on the Naïve Bayes model performance.

Original patient data are stored in a Teradata database. The predictions from the old Naïve Bayes model are stored in a KNIME Table.

Teradata Aster is a proprietary database system that may be in use at your company/organization. It is designed to enable multi-genre advanced data transformation on massive amounts of data. If your company/organization is a Teradata Aster customer, you can obtain the JDBC driver that interfaces with KNIME by contacting your company’s/organization’s Teradata Aster account executive.

Table format is a KNIME proprietary format to store data efficiently, in terms of size and retrieval speed, and completely, i.e. also including their structure metadata. This leads to smaller local files, faster reading, and minimal configuration settings. In fact, the Table Reader node, which reads such Table files, only needs the file path and retrieves all other necessary information from the metadata saved in the file itself. Files saved in KNIME Table format carries an extension “.table”.

Teradata Aster on one side, KNIME Table formatted file on the other side. The question, as usual, is: Will they blend? Let’s find out.

Topic. Predicting heart disease. Is this chest pain innocuous or serious?

Challenge. Blend data from Teradata Aster system with data from a KNIME .table file. Build a predictive model to establish presence or absence of heart disease.

Access Mode. Database Connector node with Teradata JDBC driver to retrieve data from Teradata database. Table Reader node to read KNIME Table formatted files.

Read more


10 Apr 2017rs

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Blending Databases. A Database Jam Session

The Challenge

Today we will push the limits by attempting to blend data from not just 2 or 3, but 6 databases!

These 6 SQL and noSQL databases are among the top 10 most used databases, as listed in most database comparative web sites (see DB-Engines Ranking, The 10 most popular DB Engines …, Top 5 best databases). Whatever database you are using in your current data science project, there is a very high probability that it will be in our list today. So, keep reading!

What kind of use case is going to need so many databases? Well, actually it’s not an uncommon situation. For this experiment, we borrowed the use case and data sets used in the Basic and Advanced Course on KNIME Analytics Platform. In this use case, a company wants to use past customer data (behavioral, contractual, etc…) to find out which customers are more likely to buy a second product. This use case includes 6 datasets, all related to the same pool of customers.

  1. Customer Demographics (Oracle). This dataset includes age, gender, and all other classic demographic information about customers, straight from your CRM system. Each customer is identified by a unique customer key. One of the features in this dataset is named “Target” and describes whether the customer, when invited, bought an additional product. 1 = he/she bought the product; 0 = he/she did not buy the product. This dataset has been stored in an Oracle
  2. Customer Sentiment (MS SQL Server). Customer sentiment about the company has been evaluated with some customer experience software and reported in this dataset. Each customer key is paired with customer appreciation, which ranges on a scale from 1 to 5. This dataset is stored on a Microsoft SQL Server
  3. Sentiment Mapping (MariaDB). This dataset here contains the full mapping between the appreciation ranking numbers in dataset # 2 and their word descriptions. 1 means “very negative”, 5 means “very positive”, and 2, 3, and 4 cover all nuances in between. For this dataset we have chosen storage relatively new and very popular software: a MariaDB
  4. Web Activity from the company’s previous web tracking system (MySQL). A summary index of customer activity on the company web site used to be stored in this dataset. The web tracking system associated with this dataset has been declared obsolete and phased out a few weeks ago. This dataset still exists, but is not being updated anymore. A MySQL database was used to store these data.
  5. Web Activity from the company’s new web tracking system (MongoDB). A few weeks ago the original web tracking system was replaced by a newer system. This new system still tracks customers’ web activity on the company web site and still produces a web activity index for each customer. To store the results, this system relies on a new noSQL database: MongoDB. No migration of the old web activity indices has been attempted, because migrations are costly in terms of money, time, and resources. The idea is that eventually the new system will cover all customers and the old system will be completely abandoned. Till then, though, indices from the new system and indices from the old system will have to be merged together at execution time.
  6. Customer Products (PostgreSQL). For this experiment, only customers who already bought one product are considered. This dataset contains the one product owned by each customer and it is stored in a PostgreSQL

The goal of this experiment is to retrieve the data from all of these data sources, blend them together, and train a model to predict the likelihood of a customer buying a second product.

The blending challenge of this experiment is indeed an extensive one. We want to collect data from all of the following databases: MySQL, MongoDB, Oracle, MariaDB, MS SQL Server, and PostgreSQL. Six databases in total: five relational databases and one noSQL database.

Will they all blend?

Topic. Next Best Offer (NBO). Predict likelihood of customer to buy a second product.

Challenge. Blend together data from six commonly used, SQL and noSQL databases.

Access Mode. Dedicated connector nodes or generic connector node with JBDC driver.

Read more


27 Mar 2017knime_admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: YouTube Metadata meet WebLog Files. What will it be tonight – a movie or a book?

Authors: Rosaria Silipo and Iris Adä

The Challenge

Thank God it’s Friday! And with Friday, some free time! What shall we do? Watch a movie or read a book? What do the other KNIME users do? Let’s check!

When it comes to KNIME users, the major video source is YouTube; the major reading source is the KNIME blog. So, do KNIME users prefer to watch videos or read blog posts? In this experiment we extract the number of views for both sources and compare them.

YouTube offers an access REST API service as part of the Google API. As for all Google APIs, you do not need a full account if all you want to do is search; a key API is enough. You can request your own key API directly on the Google API Console. Remember to enable the key for the YouTube API services. The available services and the procedure to get a key API are described in these 2 introductory links:
https://developers.google.com/apis-explorer/?hl=en_US#p/youtube/v3/
https://developers.google.com/youtube/v3/getting-started#before-you-start

On YouTube the KNIME TV channel hosts more than 100 tutorial videos. However, on YouTube you can also find a number of other videos about KNIME Analytics Platform posted by community members. For the KNIME users who prefer to watch videos, it could be interesting to know which videos are the most popular, in terms of number of views of course.

The KNIME blog has been around for a few years now and hosts weekly or biweekly content on tips and tricks for KNIME users. Here too, it would be interesting to know which blog posts are the most popular ones among the KNIME users who prefer to read – also in terms of number of views! The numbers for the blog posts can be extracted from the weblog file of the KNIME web site.

YouTube with REST API access on one side and blog page with weblog file on the other side. Will they blend?

Topic. Popularity (i.e. number of views) of blog posts and YouTube videos.

Challenge. Extract metadata from YouTube videos and metadata from KNIME blog posts.

Access Mode. WebLog Reader and REST service.

Read more


13 Mar 2017knime_admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Kindle epub meets image JPEG: Will KNIME make peace between the Capulets and the Montagues?

Authors: Heather Fyson and Kilian Thiel

The Challenge

A plague o’ both your houses! They have made worms’ meat of me!” said Mercutio in Shakespeare’s “Romeo and Juliet” – in which tragedy results from the characters’ inability to communicate effectively. It is worsened by the fact that Romeo and Juliet each come from the feuding “two households”: Romeo a Montague and Juliet, a Capulet.

For this blog article, we decided to take a look at the interaction between the characters in the play by analyzing the script – an epub file – to see just who talks to who. Are the Montagues and Capulets really divided families? Do they really not communicate? To make the results easier to read, we decided to visualize the network as a graph, with each node in the graph representing a character in the play and showing an image of the particular character.

The “Romeo and Juliet” e-book is downloadable for free in a number of formats from the Gutenberg Project web site. For this experiment, we downloaded the epub file. epub is an e-book file format used in many e-reading devices, such as Amazon Kindle for example (for more information about the epub format, check https://en.wikipedia.org/wiki/EPUB).

The images for the characters of the Romeo and Juliet play have been kindly made available by Stadttheater Konstanz in a JPEG format from a live show. JPEG is a commonly used format to store images (for more information about the JPEG format, check https://en.wikipedia.org/wiki/JPEG).

Unlike the Montague and the Capulet families – will epub and JPEG files blend?

Topic. Analyzing the graph structure of the dialogs in Shakespeare’s tragedy “Romeo and Juliet”.

Challenge. Blending epub and JPEG files and combining text mining and network visualization.

Access Mode. epub parser and JPEG reader.

Read more


Subscribe to KNIME news, usage, and development