07 May 2018Vincenzo

Hi! My name is Emil, I am a Teacher Bot, and I can understand what you are saying.

Remember the first post of this series? There I described the many parts that make me, or at least the KNIME workflow behind me (Fig. 1). A part of that workflow was dedicated to understanding. This is obviously a crucial step, because if I cannot understand your question I will likely be unable to answer it.

Understanding consists mainly of text processing operations: text cleaning, Part-Of-Speech (POS) tagging, tagging of special words, lemmatization, and finally keyword extraction; especially keyword extraction.

Figure 1. Emil, the Teacher Bot. Here is what you need to build one: a user interface for question and answer, text processing to parse the question, a machine learning model to find the right resources, and optionally a feedback mechanism.

Keywords are routinely used for many purposes, like retrieving documents during a web search or summarizing documents for indexing. Keywords are the smallest units that can summarize the content of a document and they are often used to pin down the most relevant information in a text.

Automatic keyword extraction methods are wide spread in Information Retrieval (IR) systems, Natural Language Processing (NLP) applications, Search Engine Optimization (SEO), and Text Mining. The idea is to reduce the representation word set of a text from the full list of words, i.e. that comes out of the Bag of Words  technique, to a handful of keywords. The advantage is clear. If keywords are chosen carefully, the text representation dimensionality is drastically reduced, while the information content is not.

Read more


30 Apr 2018rs

Hi! My name is Emil and I am a Teacher Bot. I am here to point you to the right training materials for your early questions on how to use KNIME.

In a previous post, I described the KNIME workflow that built me. In that post the focus was on the details of my deployment (Fig. 1). A brain was mentioned, but very little explanation was given on the how-to for assembling it. The moment has arrived to give you some of these details and, in particular, to explain how my brain was conceptually designed to give answers.

Remember? It all starts with a question, your question. After parsing it and understanding it with the help of the KNIME Text Processing extension, queries are sent to my brain for possible answers.

Figure 1. Emil, the Teacher Bot. Here is what you need to build one: a user interface for question and answer, text processing to parse the question, a machine learning model to find the right resources, and optionally a feedback mechanism.

Educational Resources on the KNIME site

My ultimate goal is to provide you with the one and only web tutorial that perfectly answers your question. Let’s start then with the tutorial material that is available on the KNIME website.

First of all, of course, the e-learning course. The e-learning course consists of 7 chapters (so far). From installing KNIME Analytics Platform and its extensions to data access, from ETL procedures to Machine Learning, from data visualization to flow control: there is a good chance the topic you are looking for is here.

Another very much visited resource is the KNIME blog. The KNIME blog started in 2014 with the mission of producing nuggets of data science, machine learning, and KNIME tools.

Then, there is the Node Guide. Similar to the blog, this resource contains even more atomic pieces of information on single node and single use case usage.

Finally, there could be all the books from the KNIME Press, covering basic and advanced KNIME, text processing, data blending, ETL, etc …

Read more


23 Apr 2018admin

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation?Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: A Social Forum. Sentiment vs. Influence

Author: Rosaria Silipo & Kilian Thiel
Data Chefs: Haruto and Momoka

Ingredient Theme: A Social Forum

Today we have decided to go vintage and show the analysis implemented in the first KNIME whitepaper, where text processing met network analytics by Tobias Koetter, Kilian Thiel, and Phil Winters.

We propose the data from year 1999 of the Slashdot News Forum. Slashdot (sometimes abbreviated as “/.”) is a social news website, which was founded in 1997 for science and technology. Users can post news and stories about diverse topics and receive online comments from other users (cfr: Wikipedia).

Some years ago, we started a debate on whether the loudest customers were as important as everybody – including they themselves - thought. We started looking for public data on customer interactions about a given product and stumbled upon the Slashdot dataset. Users in the Slashdot data set are not strictly customers; they interact via a social forum about a given topic. If the topic were a product, they would be customers. So, assuming that talking about a product is a particular instance of talking about a generic topic, we decided to adopt the Slashdot data set for the analysis. We propose this same data set here again for today’s challenge.

Read more


16 Apr 2018admin

Authors: Andisa Dewi and Kilian Thiel

Practically everyone has heard of Little Red Riding Hood, a fairy tale of an encounter between a young girl and the Big Bad Wolf. One of the most popular versions of this tale was written by the Brothers Grimm. But what has a simple folk tale got to do with text stream visualization? And what is stream visualization?

Stacked area charts, or stream visualizations, are very useful to show how topics in a single, or set of, documents change and develop over time. Each document is assigned to a single point in time, for example the publication date, and then specific topics or keywords can be visualized “stacked” on top of each other based on their frequency. A single chart is usually based on multiple documents.

So what do you think about reading the story of Little Red Riding Hood through a stream visualization? Taking the main characters of the story as our keywords, the visualization would show whenever the characters occur and co-occur, revealing their respective importance. This would give an idea of the course of the story without actually reading it!

The reason Little Red Riding Hood is good as an example is because it has a fairly simple storyline with only 5 main characters: Little Red Riding Hood, her mother, her grandmother, the Big Bad Wolf, and a Hunter.

Read more


19 Mar 2018admin

Authors: Rosaria Silipo, Vincenzo Tursi, Kathrin Melcher, and Phil Winters

Hi! My name is Emil and I am a Teacher Bot.

I was built to answer your early questions on how to use KNIME. Pardon!

I was built to point you to the right training material to help you answer your early questions on how to use KNIME.

By the way, I was myself entirely built using KNIME. So, I should know where the right answers lie in the midst of all the tutorials, videos, blog posts, whitepapers, example workflows, and more, which are available out there.

It was not so hard to build me. You just needed:

  • a user interface - possibly web or speech based - for you to ask your question
  • a text parser for me to understand your question
  • a brain to find the right training material to answer your question
  • a user interface to provide the answer back
  • a nice to have - but not necessary - feedback option, on whether my answer was of any help.

Read more


12 Mar 2018Kathrin

Regularization can be used to avoid overfitting. But what actually is regularization, what are the common techniques, and how do they differ?

Well, according to Ian Goodfellow [1]

“Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”

In other words: regularization can be used to train models that generalize better on unseen data, by preventing the algorithm from overfitting the training dataset.

So how can we modify the logistic regression algorithm to reduce the generalization error?

Common approaches I found are Gauss, Laplace, L1 and L2. KNIME Analytics Platform supports Gauss and Laplace and indirectly L2 and L1.

Read more


05 Mar 2018berthold

Systems that automate the data science cycle have been gaining a lot of attention recently. Similar to smart home assistant systems however, automating data science for business users only works for well-defined tasks. We do not expect home assistants to have truly deep conversations about changing topics. In fact, the most successful systems restrict the types of possible interactions heavily and cannot deal with vaguely defined topics. Real data science problems are similarly vaguely defined: only an interactive exchange between the business analysts and the data analysts can guide the analysis in a new, useful direction, potentially sparking interesting new insights and further sharpening the analysis.

Therefore, as soon as we leave the realm of completely automatable data science sandboxes, the challenge lies in allowing data scientists to build interactive systems, interactively assisting the business analyst in her quest to find new insights in data and predict future outcomes. At KNIME we call this “Guided Analytics”. We explicitly do not aim to replace the driver (or totally automate the process) but instead offer assistance and carefully gather feedback whenever needed throughout the analysis process. To make this successful, the data scientist needs to be able to easily create powerful analytical applications that allow interaction with the business user whenever their expertise and feedback is needed.

Read more


26 Feb 2018rs

This blog post is an extract from chapter 6 of the book “From Words to Wisdom. An Introduction to Text Mining with KNIME” by V. Tursi and R. Silipo, to be published in March 2018 by the KNIME Press. The book will be premiered at the KNIME Summit in Berlin in March.

 

Word embedding, like document embedding, belongs to the text preprocessing phase. Specifically, to the part that transforms a text into a row of numbers.

In the KNIME Text Processing extension, the Document Vector node transforms a sequence of words into a sequence of 0/1 – or frequency numbers – based on the presence/absence of a certain word in the original text. This is also called “one-hot encoding”. One-hot encoding though has two big problems:

  • it produces a very large data table with the possibility of a large number of columns;
  • it produces a very sparse data table with a very high number of 0s, which might be a problem for training certain machine learning algorithms.

The Word2Vec technique was therefore conceived with two goals in mind:

  • reduce the size of the word encoding space (embedding space);
  • compress in the word representation the most informative description for each word.

Interpretability of the embedding space becomes secondary.

Read more


19 Feb 2018admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Chinese meets English meets Thai meets German meets Italian meets Arabic meets Farsi meets Russian. Around the world in eight languages

Authors: Anna Martin, Hayley Smith, and Mallika Bose

The Challenge

No doubt you are familiar with the adventure novel “Around the World in 80 Days” in which British gentleman Phileas Fogg makes a bet that he can circumnavigate the world in 80 days. Today we will be attempting a similar journey. However, ours is unlikely to be quite as adventurous as the one Phileas made. We won’t be riding Elephants across the Indian mainland, nor rescuing our travel companion from the circus. And we certainly won’t be getting attacked by Native American Sioux warriors!

Our adventure will begin from our offices on the Lake of Constance in Germany. From there we will travel down to Italy, stopping briefly to see the Coliseum. Then across the Mediterranean to see the Pyramids of Egypt and on through the Middle East to the ancient city of Persepolis. After a detour via Russia to see the Red Square in Moscow, our next stop will be the serene beaches of Thailand for a short break before we head off to walk the Great Wall of China (or at least part of it). On the way home, we will stop in and say hello to our colleagues in the Texas office.

Like all good travelers, we want to stay up-to-date with the news the entire time. Our goal is to read the local newspapers … in the local language of course! This means reading news in German, Italian, Arabic, Farsi, Chinese, Russian, Thai, and lastly, English. Impossible you say? Well, we’ll see.

The real question is: will all those languages blend?

Topic. Blending news in different languages

Challenge. Will the Text Processing nodes support all the different encodings?

Access Mode. Text Processing nodes and RSS Feed Reader node

Read more


12 Feb 2018Jeany

Exploration, analysis, visualization: this article highlights this functionality in KNIME Analytics Platform using sunburst charts, tag clouds and networks. We’ll use life-science data for this blog post, but all of this can be applied to diverse kinds of datasets. So if you have a different background, we warmly invite you to keep reading. Fair warning though: if you are snacking in front of your computer, you might want to swallow first.

Before we dive into the workflow itself, here’s a bit of background information on the problem. Many human diseases are caused by genetic factors. Learning more about these factors is important, because the insight we gain can improve the chances of finding cures and help guide treatment decisions. Here we want to show an example of how to investigate disease-related genes.

First, we’ll give a quick general overview of the workflow (see Fig.1) and then explain each step using a particular disease as an example. The interactive views we describe here are accessible in two ways: via the KNIME WebPortal, and by showing the interactive view of the wrapped metanodes in KNIME Analytics Platform. The interactive views are practical because even someone who is not a KNIME user can use the workflow to interactively explore data and generate knowledge.

Figure 1. Workflow to Interactively Investigate Disease Genes. Interactive views are generated by the wrapped metanodes and are accessible via the KNIME WebPortal or via the node view. The workflow is available on the KNIME EXAMPLES server under 03_Visualization/02_JavaScript/10_Disease_Genes03_Visualization/02_JavaScript/10_Disease_Genes* .

Read more


Subscribe to KNIME news, usage, and development