26 Feb 2018rs

This blog post is an extract from chapter 6 of the book “From Words to Wisdom. An Introduction to Text Mining with KNIME” by V. Tursi and R. Silipo, to be published in March 2018 by the KNIME Press. The book will be premiered at the KNIME Summit in Berlin in March.

 

Word embedding, like document embedding, belongs to the text preprocessing phase. Specifically, to the part that transforms a text into a row of numbers.

In the KNIME Text Processing extension, the Document Vector node transforms a sequence of words into a sequence of 0/1 – or frequency numbers – based on the presence/absence of a certain word in the original text. This is also called “one-hot encoding”. One-hot encoding though has two big problems:

  • it produces a very large data table with the possibility of a large number of columns;
  • it produces a very sparse data table with a very high number of 0s, which might be a problem for training certain machine learning algorithms.

The Word2Vec technique was therefore conceived with two goals in mind:

  • reduce the size of the word encoding space (embedding space);
  • compress in the word representation the most informative description for each word.

Interpretability of the embedding space becomes secondary.

Read more


19 Feb 2018admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Chinese meets English meets Thai meets German meets Italian meets Arabic meets Farsi meets Russian. Around the world in eight languages

Authors: Anna Martin, Hayley Smith, and Mallika Bose

The Challenge

No doubt you are familiar with the adventure novel “Around the World in 80 Days” in which British gentleman Phileas Fogg makes a bet that he can circumnavigate the world in 80 days. Today we will be attempting a similar journey. However, ours is unlikely to be quite as adventurous as the one Phileas made. We won’t be riding Elephants across the Indian mainland, nor rescuing our travel companion from the circus. And we certainly won’t be getting attacked by Native American Sioux warriors!

Our adventure will begin from our offices on the Lake of Constance in Germany. From there we will travel down to Italy, stopping briefly to see the Coliseum. Then across the Mediterranean to see the Pyramids of Egypt and on through the Middle East to the ancient city of Persepolis. After a detour via Russia to see the Red Square in Moscow, our next stop will be the serene beaches of Thailand for a short break before we head off to walk the Great Wall of China (or at least part of it). On the way home, we will stop in and say hello to our colleagues in the Texas office.

Like all good travelers, we want to stay up-to-date with the news the entire time. Our goal is to read the local newspapers … in the local language of course! This means reading news in German, Italian, Arabic, Farsi, Chinese, Russian, Thai, and lastly, English. Impossible you say? Well, we’ll see.

The real question is: will all those languages blend?

Topic. Blending news in different languages

Challenge. Will the Text Processing nodes support all the different encodings?

Access Mode. Text Processing nodes and RSS Feed Reader node

Read more


12 Feb 2018Jeany

Exploration, analysis, visualization: this article highlights this functionality in KNIME Analytics Platform using sunburst charts, tag clouds and networks. We’ll use life-science data for this blog post, but all of this can be applied to diverse kinds of datasets. So if you have a different background, we warmly invite you to keep reading. Fair warning though: if you are snacking in front of your computer, you might want to swallow first.

Before we dive into the workflow itself, here’s a bit of background information on the problem. Many human diseases are caused by genetic factors. Learning more about these factors is important, because the insight we gain can improve the chances of finding cures and help guide treatment decisions. Here we want to show an example of how to investigate disease-related genes.

First, we’ll give a quick general overview of the workflow (see Fig.1) and then explain each step using a particular disease as an example. The interactive views we describe here are accessible in two ways: via the KNIME WebPortal, and by showing the interactive view of the wrapped metanodes in KNIME Analytics Platform. The interactive views are practical because even someone who is not a KNIME user can use the workflow to interactively explore data and generate knowledge.

Figure 1. Workflow to Interactively Investigate Disease Genes. Interactive views are generated by the wrapped metanodes and are accessible via the KNIME WebPortal or via the node view. The workflow is available on the KNIME EXAMPLES server under 03_Visualization/02_JavaScript/10_Disease_Genes03_Visualization/02_JavaScript/10_Disease_Genes* .

Read more


31 Jan 2018jonfuller

Introduction

In my previous blog post “Learning Deep Learning”, I showed how to use the KNIME Deep Learning - DL4J Integration to predict the handwritten digits from images in the MNIST dataset. That’s a neat trick, but it’s a problem that has been pretty well solved for a while. What about trying something a bit more difficult? In this blog post I’ll take a dataset of images from three different subtypes of lymphoma and classify the image into the (hopefully) correct subtype.

KNIME Deep Learning - Keras Integration brings new deep learning capabilities to KNIME Analytics Platform. You can now use the Keras Python library to take advantage of a variety of different deep learning backends. The new KNIME nodes provide a convenient GUI for training and deploying deep learning models while still allowing model creation/editing directly in Python for maximum flexibility.

The workflows mentioned in this blogpost require a fairly heavy amount of computation (and waiting), so if you’re just looking to check out the new integration, see the simple workflow here that recapitulates the results of the previous blog post using the new Keras Integration. There are quite a few more example workflows for both DL4J and Keras which can be found in the relevant section of the Node Guide.

Right, back to the challenge. Malignant lymphoma affects many people, and among malignant lymphomas, CLL (chronic lymphocytic leukemia), FL (follicular lymphoma), and MCL (mantle cell lymphoma) are difficult for even experienced pathologists to accurately classify.A typical task for a pathologist in a hospital would be to look at those images and make a decision about what type of lymphoma is present. In many cases, follow-up tests to confirm the diagnosis are required. An assistive technology that can guide the pathologist and speed up their job would be of great value. Freeing up the pathologist to spend their time on those tasks that computers can’t do so well, has obvious benefits for the hospital, the pathologist, and the patients.

Figure 1. The modeling process adopted to classify lymphoma images. At each stage the required components are listed.

Read more


22 Jan 2018Kathrin

Watched all the trilogies on Netflix already? Then it’s time to change channels to KNIME TV on YouTube!

The brand new trilogy, bringing logistic regression with KNIME to your screen, is finally available in its entirety!

Call your friends, grab your popcorn and be the first to watch all three parts!

The first movie introduces the trilogy’s greatest character: the algorithm behind the Logistic Regression Learner node. The second film draws you in to the intricacies of the algorithm’s configuration in the KNIME Learner Node, while the final movie sees a happy ending, featuring the various options for memory handling and the output order of coefficients.

Read more


15 Jan 2018admin

Authors: Daria Goldmann and Greg Landrum

In a recent blog post, we discussed creating web services using KNIME Analytics Platform and KNIME Server - now we want to look at calling web services with KNIME.

Since this post is from the Life Sciences team at KNIME and we’ve been investigating ChEMBL web services recently, we’d like to use them as an example here. Please note that there is a set of community KNIME nodes for accessing ChEMBL and ChEBI and we are intentionally duplicating some of that functionality here.

ChEMBL itself is a great Open Data resource. It provides a large collection of linked information on compounds and their structures, biological targets and their sequences, biological assays and their experimental details. The data are largely collected from scientific publications with each entry in the database represented by a unique identifier - a ChEMBL ID. It’s all freely available for download in relational form or can be accessed using a REST API. That’s what we look at here.

Don’t stop reading... if you’re from another field and not really interested in ChEMBL or the data it contains! The patterns we use here for interacting with the web services and looking at the results will work for many other RESTful web APIs.

Read more


08 Jan 2018admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: A Recipe for Delicious Data – Part 2: The new Google Sheets Nodes

Authors: Rene Damyon and Oleg Yasnev

Post Update!

This is the updated version of the original blog post “A Recipe for Delicious Data: Mashing Google and Excel Sheets”, using the new Google Sheets nodes available in KNIME Analytics Platform 3.5.
 

The Challenge

Remember this blog post from July 2017?

A local restaurant has been keeping track of its business on Excel in 2016 and moved to Google Sheets in 2017. The challenge was then to include data from both sources to compare business trends in 2016 and in 2017, both as monthly total and Year To Date (YTD) revenues.

The technical challenge of this experiment was then of the “Will they blend?” type: mashing the data from the Excel and Google spreadsheets into something delicious… and digestible. The data blending was indeed possible and easy for public Google Sheets. However, it became more cumbersome for private Google Sheets, by requiring a few external steps for user authentication.

From the experience of such a blog post, a few Google Sheets dedicated nodes have been built and released with the new KNIME Analytics 3.5. A number of new nodes indeed are now available to connect, read, write, update, and append cells, rows, and columns into a private or public Google Sheet.

The technical challenge then has become easier: accessing Google Sheets with these new dedicated nodes and mashing the data with data from an Excel Sheet. Will they blend?

Topic. Monthly and YTD revenue figures for a small local business.

Challenge. Retrieve data from Google Sheets using the new Google Sheets nodes available in KNIME Analytics Platform 3.5.

Access Mode. Excel Reader node and Google Sheets Reader node for private and public documents.

Read more


18 Dec 2017Mallika Bose

In early November, we hosted the KNIME Fall Summit for the second time in the US, which brought together KNIME users from all over the world. Watch the live video recordings below to learn about what’s new in KNIME, what’s cooking in the KNIME Labs, and much more!

Welcome Speech by Michael Berthold, CEO of KNIME AG

Read more


11 Dec 2017admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: SparkSQL meets HiveQL. Women, Men, and Age in the State of Maine

Authors: Rosaria Silipo and Anna Martin

The Challenge

After seeing the foliage in Maine, I seriously gave a thought of moving up there in the beauty of nature and in the peace of a quieter life. I then started doing some research on Maine, its economy and its population.

As it happens, I do have the sampled demographics data for the state of Maine for the years 2009-2014, as part of the CENSUS dataset.

I have the whole CENSUS dataset stored on a Apache Hive installation on a Cloudera cluster running on the Amazon cloud. It could then be processed on Apache Hive or on Apache Spark using the KNIME Big Data Extensions.

News!!! KNIME Big Data Extensions have been open sourced with the last release of KNIME Analytics Platform 3.5. All Big Data nodes in the Node Repository now require no license to run. Check the “What’s new in KNIME 3.5” page for more details on the new release.

KNIME Big Data Extensions offer a variety of nodes to execute Apache Spark or Apache Hive scripts. Hive execution relies on the nodes for in-database processing. Spark execution has its dedicated nodes. However, it also provides an SQL integration to run SQL queries on the Apache Spark execution engine.

We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. On Apache Hive or on Apache Spark? Why not both? We could use SparkSQL to extract men’s age distribution and HiveQL to extract women’s age distribution. We could then compare the two distributions and see if they show any difference.

But the main question, as usual, is: will SparkSQL queries and HiveQL queries blend?

Topic. Age distribution for men and women in the US state of Maine

Challenge. Blend results from Hive SQL and Spark SQL queries.

Access Mode. Apache Spark and Apache Hive nodes for SQL processing

Read more


04 Dec 2017Iris

Figure 1. You can find all date and time nodes in the Time Series category of KNIME Analytics Platform

For the KNIME Analytics Platform 3.4 release we did a full rewrite of our Date and Time support. This blog post will get you introduced to the new Date & Time integration and its features.

I will start with the new Date and Time column types, how they differ from the old Date and Time column type, and how they can be converted. Afterwards, I will talk about the features which were included into the new integration.

Before going into any details, I want to share my personal list of highlights with you. They are (in no particular order):

  • Time zone support
  • Multiple columns support
  • Higher flexibility for measuring time differences
  • Auto-guessing of string formats
  • Flow variable support

 

 

The new Date & Time column types

Before this release we had a single column type that could be used for either date only, time only, or date and time. Now, we have a dedicated column type for each of these uses. In addition to these three column types, we now have a column type for date and time with a time zone. We thus have four new column types for representing date and time:

  • Date (e.g. the first of November 2017)
  • Time (e.g., 9:30 A.M.)
  • Date &Time (e.g., the first of November 2017, 9:30 A.M.)
  • Date & Time with zone (e.g., the first of November 2017, 9:30 A.M. in Europe/Berlin ).

The first three represent exactly what the old column types did; the fourth is adding new functionality.

Read more


Subscribe to KNIME news, usage, and development