KNIME Blog: general

KNIME general.

An Active Learning Cycle to Train Emil the Teacher Bot

Mon, 06/18/2018 - 14:28 admin

Authors: Vincenzo Tursi, Kathrin Melcher, Rosaria Silipo

Remember Emil the Teacher Bot? Well, today we want to talk about how Emil’s brain was created! The goal when creating Emil’s brain was to enable him to be able to associate the most suitable resources on the KNIME website with keywords from the questions he is asked.

Before we continue, let’s recap what we’ve talked about so far in this Teacher Bot series:

  • The first blog post “Emil the Teacher Bot” describes how the workflow behind Emil assembles a web browser based GUI that has text processing and keyword extraction capabilities with machine learning algorithms.
  • The second post “Keyword Extraction for Understanding” discussed automatic keywords extraction algorithms available in KNIME Analytics Platform and our choice criterion .
  • The third post in the series “An Ontology for Emil” stressed the importance of defining a class ontology before starting a classification project.

As we’ve shown in these previous blog posts, our source was the questions from the KNIME Forum posted between 2013 and 2017. Questions were imported and stored as unanswered, as only a few answers contained links to educational resources and only some of those links referred to up-to-date educational material. At the same time, we also adopted a class ontology with 20 classes.

Emil’s brain’s goal then became two fold:

  1. Associate the right class from the ontology to the keywords summarizing each question.
  2. Within each predicted class, explore educational resources on KNIME site and extract the top four most relevant ones.

Today, let’s concentrate on goal # 1: Associate the right class from the ontology to the input question; i.e. the keywords summarizing the question.

Will They Blend? Experiments in Data & Tool Blending. Today: BIRT meets Tableau and JavaScript. How was the Restaurant?

Mon, 06/11/2018 - 10:00 ScottF

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: BIRT meets Tableau and JavaScript. How was the Restaurant?

The Challenge

How would your favorite restaurant perform in a health inspection? Are you sure you are eating in one of the most reliable and safest restaurants in town? If you live in Austin, Texas, we can check.

Over the last three calendar years, data.austintexas.gov has published a dataset of restaurant inspection scores for Austin (TX) restaurants. Inspection scores range between 0 (theoretically) and 100. An inspection score lower than 70 requires corrective measures; repeated low scores may even necessitate the closure of the restaurant.

We will use this dataset to visually explore:

  • How many restaurants have been inspected in each ZIP code area;
  • Of those restaurants how many have scored between 70 and 80, 80 and 90, and 90 and 100, as well as perfect 100 scores;
  • And finally, the average scores for ZIP code locations in the Austin area.

For each one of these goals we will use a:

  • pie chart;
  • grouped bar chart;
  • geolocation map.

So far so good. Now we need to choose the graphical tool for such representations.

Since data manipulation within a reporting environment might be clunky, many KNIME users prefer to run the data manipulation comfortably from KNIME Analytics Platform, and later export the data to their preferred reporting tool.

There are many ways to produce graphics in a report with KNIME Analytics Platform. Today we will consider three: with native graphics in BIRT, with JavaScript based images exported to BIRT, with native graphics in Tableau.

BIRT (Business Intelligence Reporting Tool) is a reporting tool, which to a certain extent, is distributed as an open source tool. The KNIME Report Designer extension integrates BIRT within KNIME Analytics Platform. The “Open Report” button in the KNIME tool bar takes you to the BIRT report environment. The “KNIME” button in the BIRT report environment takes you back to the KNIME workbench.

Tableau Desktop is a reporting tool, which requires a commercial license. 14-day trial licenses are also available at the Tableau Desktop site. The KNIME Tableau extension allows you to communicate with Tableau Desktop via TDE files or directly with the Tableau Server.

It is also possible to produce the graphic image in the KNIME workflow via the JavaScript nodes and subsequently export the image to the reporting tool of your choice.

In today’s Will They Blend, we want to highlight the process of generating a few simple graphics using BIRT, JavaScript, and Tableau. Can we produce three simple charts using those tools? What might that involve? Let’s find out!

Topic. Data Visualization using BIRT, Tableau, and JavaScript

Challenge. Create a Pie Chart, a Bar Chart, and a geo-location map using BIRT, Tableau, and JavaScript

Access Mode. KNIME Tableau extension, Report Designer extension, JavaScript Views extension, Open Street Map Integration extension

Solving a Kaggle Challenge using the combined power of KNIME Analytics Platform & H2O

Mon, 06/04/2018 - 14:04 Marten Pfannen…

Some time ago, we set our mind to solving a popular Kaggle challenge offered by a Japanese restaurant chain: predict how many future visitors a restaurant will receive.

This is a classic demand prediction problem: how much energy will be required in the next N days, how many milk boxes will be in demand tomorrow, and how many customers will visit our restaurants tonight? We already know how to use KNIME Analytics Platform to solve this kind of time series analytics problems (see whitepaper on energy prediction). So, this time we decided to go for a different approach: a mixed approach.

Thanks to the open architecture of KNIME Analytics Platform, we can practically plug in almost any open source analytics tool, such as Python, R, Weka, to name just three very prominent examples - and, more recently also H2O.

We already developed a cross-platform ensemble model to predict flight delays (another popular challenge). Here, cross-platform means that we trained a model with KNIME, a model with Python, and a model with R. These models from different platforms were then blended together as an ensemble model in a KNIME workflow. Indeed, one of KNIME Analytics Platform’s many qualities consists of its capability to blend data sources, data, models, and, yes, also tools.

For this restaurant demand prediction challenge we decided to raise the bar and develop a solution using the combined power of KNIME Analytics Platform and H2O.

Stuck in the Nine Circles of Hell? Try Parameter Optimization & A Cup of Tea

Mon, 05/28/2018 - 10:14 daria.goldmann

Every data scientist has been there: a new data set and you’re going through nine circles of hell trying to build the best possible model. Which machine learning method will work best this time? What values should be used for the hyperparameters? Which features would best describe the data set? Which combination of all of these would lead to the best model? There is no single right answer to these questions because, as we know, it’s impossible to know a priori which method or features will perform best for any given data set. And that is where parameter optimization comes in.

Parameter optimization is an iterative search for the set of hyperparameters of a machine learning method that leads to the most successful model based on a user-defined optimization function.

Here, we introduce an advanced parameter optimization workflow that uses four common machine learning methods, individually optimizes their hyperparameters, and picks the best combination for the user. In the current implementation the choice of features and one hyperparameter per method are optimized. However, we encourage you to use this workflow as a starting point or a template if you have completely different data and customize it by including additional parameters into the optimization loop (and we will show where you could do that).

Will They Blend? Experiments in Data & Tool Blending. Today: The GDPR Force Meets Customer Intelligence – Is It The Dark Side?

Tue, 05/22/2018 - 15:17 admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: The GDPR Force Meets Customer Intelligence – Is It The Dark Side?

Authors: Rosaria Silipo and Phil Winters

The Challenge

The European Union is introducing the General Data Protection Regulation on May 25 2018. This new law has been called the world’s strongest and most far-reaching law aimed at strengthening citizens' fundamental rights in the digital age. The new law applies to any organization processing personal data about a citizen or resident in the EU regardless of that organizations’ location. It is definitely a force to be reckoned with as the fines for non-compliance are extremely high.

Many of the law’s Articles deal with how personal data are collected and processed and include a special emphasis, guidelines and restrictions on what the EU calls “automatic profiling with personal data”. So what has that to do with us data scientists? Many of us use methods and machine learning to create new fact-based insight from the customer or personal data that we collect so that we can take decisions, possibly automatically. So the law applies to us. But that Customer Intelligence we create is fundamental for the successful running of our organizations business.

Does this mean this new force will put limits on what we do? Or even worse, does it mean that we have to go over to the dark side to help our businesses? Absolutely not! The new law defines that we must gain permission, perform tests, and document. In no way does it restrict what honest data scientists can do. If anything, it provides us with opportunities.

Topic. GDPR meets Customer Intelligence

Challenge. Apply machine learning algorithms to create new fact-based insights taking the new GDPR into account

Keyword Extraction for Understanding

Mon, 05/07/2018 - 10:54 Vincenzo

Hi! My name is Emil, I am a Teacher Bot, and I can understand what you are saying.

Remember the first post of this series? There I described the many parts that make me, or at least the KNIME workflow behind me (Fig. 1). A part of that workflow was dedicated to understanding. This is obviously a crucial step, because if I cannot understand your question I will likely be unable to answer it.

Understanding consists mainly of text processing operations: text cleaning, Part-Of-Speech (POS) tagging, tagging of special words, lemmatization, and finally keyword extraction; especially keyword extraction.

Figure 1. Emil, the Teacher Bot. Here is what you need to build one: a user interface for question and answer, text processing to parse the question, a machine learning model to find the right resources, and optionally a feedback mechanism.

Keywords are routinely used for many purposes, like retrieving documents during a web search or summarizing documents for indexing. Keywords are the smallest units that can summarize the content of a document and they are often used to pin down the most relevant information in a text.

Automatic keyword extraction methods are wide spread in Information Retrieval (IR) systems, Natural Language Processing (NLP) applications, Search Engine Optimization (SEO), and Text Mining. The idea is to reduce the representation word set of a text from the full list of words, i.e. that comes out of the Bag of Words  technique, to a handful of keywords. The advantage is clear. If keywords are chosen carefully, the text representation dimensionality is drastically reduced, while the information content is not.

An Ontology for Emil the Teacher Bot

Mon, 04/30/2018 - 10:00 rs

Hi! My name is Emil and I am a Teacher Bot. I am here to point you to the right training materials for your early questions on how to use KNIME.

In a previous post, I described the KNIME workflow that built me. In that post the focus was on the details of my deployment (Fig. 1). A brain was mentioned, but very little explanation was given on the how-to for assembling it. The moment has arrived to give you some of these details and, in particular, to explain how my brain was conceptually designed to give answers.

Remember? It all starts with a question, your question. After parsing it and understanding it with the help of the KNIME Text Processing extension, queries are sent to my brain for possible answers.

Figure 1. Emil, the Teacher Bot. Here is what you need to build one: a user interface for question and answer, text processing to parse the question, a machine learning model to find the right resources, and optionally a feedback mechanism.

Educational Resources on the KNIME site

My ultimate goal is to provide you with the one and only web tutorial that perfectly answers your question. Let’s start then with the tutorial material that is available on the KNIME website.

First of all, of course, the e-learning course. The e-learning course consists of 7 chapters (so far). From installing KNIME Analytics Platform and its extensions to data access, from ETL procedures to Machine Learning, from data visualization to flow control: there is a good chance the topic you are looking for is here.

Another very much visited resource is the KNIME blog. The KNIME blog started in 2014 with the mission of producing nuggets of data science, machine learning, and KNIME tools.

Then, there is the Node Guide. Similar to the blog, this resource contains even more atomic pieces of information on single node and single use case usage.

Finally, there could be all the books from the KNIME Press, covering basic and advanced KNIME, text processing, data blending, ETL, etc …

Data Chef ETL Battles. What can be prepared with today’s data? Ingredient Theme: A Social Forum. Sentiment vs. Influence

Mon, 04/23/2018 - 08:27 admin

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation?Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: A Social Forum. Sentiment vs. Influence

Author: Rosaria Silipo & Kilian Thiel
Data Chefs: Haruto and Momoka

Ingredient Theme: A Social Forum

Today we have decided to go vintage and show the analysis implemented in the first KNIME whitepaper, where text processing met network analytics by Tobias Koetter, Kilian Thiel, and Phil Winters.

We propose the data from year 1999 of the Slashdot News Forum. Slashdot (sometimes abbreviated as “/.”) is a social news website, which was founded in 1997 for science and technology. Users can post news and stories about diverse topics and receive online comments from other users (cfr: Wikipedia).

Some years ago, we started a debate on whether the loudest customers were as important as everybody – including they themselves - thought. We started looking for public data on customer interactions about a given product and stumbled upon the Slashdot dataset. Users in the Slashdot data set are not strictly customers; they interact via a social forum about a given topic. If the topic were a product, they would be customers. So, assuming that talking about a product is a particular instance of talking about a generic topic, we decided to adopt the Slashdot data set for the analysis. We propose this same data set here again for today’s challenge.

Text Stream Visualization - All the Better to See You With!

Mon, 04/16/2018 - 10:29 admin

Authors: Andisa Dewi and Kilian Thiel

Practically everyone has heard of Little Red Riding Hood, a fairy tale of an encounter between a young girl and the Big Bad Wolf. One of the most popular versions of this tale was written by the Brothers Grimm. But what has a simple folk tale got to do with text stream visualization? And what is stream visualization?

Stacked area charts, or stream visualizations, are very useful to show how topics in a single, or set of, documents change and develop over time. Each document is assigned to a single point in time, for example the publication date, and then specific topics or keywords can be visualized “stacked” on top of each other based on their frequency. A single chart is usually based on multiple documents.

So what do you think about reading the story of Little Red Riding Hood through a stream visualization? Taking the main characters of the story as our keywords, the visualization would show whenever the characters occur and co-occur, revealing their respective importance. This would give an idea of the course of the story without actually reading it!

The reason Little Red Riding Hood is good as an example is because it has a fairly simple storyline with only 5 main characters: Little Red Riding Hood, her mother, her grandmother, the Big Bad Wolf, and a Hunter.

Emil the Teacher Bot

Mon, 03/19/2018 - 15:19 admin

Authors: Rosaria Silipo, Vincenzo Tursi, Kathrin Melcher, and Phil Winters

Hi! My name is Emil and I am a Teacher Bot.

I was built to answer your early questions on how to use KNIME. Pardon!

I was built to point you to the right training material to help you answer your early questions on how to use KNIME.

By the way, I was myself entirely built using KNIME. So, I should know where the right answers lie in the midst of all the tutorials, videos, blog posts, whitepapers, example workflows, and more, which are available out there.

It was not so hard to build me. You just needed:

  • a user interface - possibly web or speech based - for you to ask your question
  • a text parser for me to understand your question
  • a brain to find the right training material to answer your question
  • a user interface to provide the answer back
  • a nice to have - but not necessary - feedback option, on whether my answer was of any help.
Subscribe to KNIME Blog: general