KNIME news, usage, and development

Data Chef ETL Battles. What can be prepared with today’s data? Ingredient Theme: Energy Consumption Time Series

Mon, 08/14/2017 - 11:55rs

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: Energy Consumption Time Series. Behavioral Measures over Time and Seasonality Index from Auto-Correlation.

Author: Rosaria Silipo
Data Chefs: Haruto and Momoka

Ingredient Theme: Energy Consumption Time Series

Let’s talk today about electricity and its consumption. One of the hardest problems in the energy industry is matching supply and demand. On the one hand, over-production of energy can be a waste of resources; on the other hand, under-production can leave people without the basic commodities of modern life. The prediction of the electrical energy demand at each point in time is therefore a very important chapter in data analytics.

For this reason, a couple of years ago energy companies started to monitor the electricity consumption of each household, store, or other entity, by means of smart meters. A pilot project was launched in 2009 by the Irish Commission for Energy Regulation (CER).

The Smart Metering Electricity Customer Behaviour Trials (CBTs) took place during 2009 and 2010 with over 5,000 Irish homes and businesses participating. The purpose of the trials was to assess the impact on consumers’ electricity consumption, in order to inform the cost-benefit analysis for a national rollout. Electric Ireland residential and business customers and Bord Gáis Energy business customers who participated in the trials, had an electricity smart meter installed in their homes or on their premises and agreed to take part in research to help establish how smart metering can help shape energy usage behaviors across a variety of demographics, lifestyles, and home sizes. The trials produced positive results.  The reports are available from CER (Commission for Energy Regulation) along with further information on the Smart Metering Project. In order to get a copy of the data set, fill out this request form and email it to ISSDA.

The data set is just a very long time series: one column covers the smart meter ID, one column the time, and one column the amount of electricity used in the previous 30 minutes. The time is expressed in number of minutes from 01.01.2009 : 00.00 and has to be transformed back to one of the classic date/time formats, like for example dd.MM.yyyy : HH.mm. The original sampling rate, at which the used energy is measured, is every 30 minutes.

The first data transformations, common to all data chefs, involve the date/time conversion and the extraction of year, month, day of month, day of week, hour, and minute from the raw date.

Topic. Energy Consumption Time Series

Challenge. From time series to behavioral measures and seasonality

Methods. Aggregations at multiple levels, Correlation

Data Manipulation Nodes. GroupBy, Pivoting, Linear Correlation, Lag Column

Distributed executors in the next major version of KNIME Server

Mon, 08/07/2017 - 11:25thor

If you are a KNIME Server customer you probably noticed that the changelog file for the KNIME Server 4.5 release was rather short compared to the one in previous releases. This means by no means that we were lazy! Together with introducing new features and improving existing features, we also started working on the next generation of KNIME Servers. You can see a preview of what is there to come in the so-called distributed executors. In this article I will explain what a distributed executor is and how it can be useful to you. I will also provide some technical details for the geeks among you and finally I will give you a rough timeline for the distributed executors' final release.

Setting up the KNIME Python extension. Revisited for Python 3.0 and 2.0

Mon, 07/31/2017 - 09:57greglandrum

As part of the v3.4 release of KNIME Analytics Platform, we rewrote the Python extensions and added support for Python 3 as well as Python 2. Aside from the Python 3 support, the new nodes aren’t terribly different from a user perspective, but the changes to the backend give us more flexibility for future improvements to the integration. This blog post provides some advice on how to set up a Python environment that will work well with KNIME as well as how to tell KNIME about that environment.

The Python Environment

We recommend using the Anaconda Python distribution from Continuum Analytics. There are many reasons to like Anaconda, but the important things here are that it can be installed without administrator rights, supports all three major operating systems, and provides all of the packages needed for working with KNIME “out of the box”.

Get started by installing Anaconda from the link above. You’ll need to choose which version of Python you prefer (we recommend that you use Python 3 if possible) but this just affects your default Python environment; you can create environments with other Python versions without doing a new install. For example, if I install Anaconda3 I can still create Python 2 environments.

Will They Blend? Experiments in Data & Tool Blending. Today: A Recipe for Delicious Data: Mashing Google and Excel Sheets

Mon, 07/24/2017 - 10:47amartin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: A Recipe for Delicious Data: Mashing Google and Excel Sheets

The Challenge

Don’t be confused! This is not one of the data chef battles, but  a “Will they blend?” experiment - which, just by chance, happens to be on a restaurant theme again.

A local restaurant has been running its business relatively successfully for a few years now. It is a small business. An Excel Sheet was enough for the full accounting in 2016. To simplify collaboration, the restaurant owner decided to start using Google Sheets at the beginning of 2017. Now (2017 with Google Sheets) she faces the same task every month of calculating the monthly and YTD revenues and comparing them with the corresponding prior-year values (2016 with Microsoft Excel). 

The technical challenge at the center of this experiment is definitely not a trivial matter: mashing the data from the Excel and Google spreadsheets into something delicious… and digestible. Will they blend?

Topic. Monthly and YTD revenue figures for a small local business.

Challenge. Blend together Microsoft Excel and Google Sheets.

Access Mode. Excel Reader and REST Google API for private and public documents.

Empower Your Own Experts! Continental Wins the Digital Leader Award

Mon, 07/17/2017 - 10:21rs

Continental, a leading automotive supplier, recently won the Digital Leader Award 2017 in the category “Empower People” for bringing big data and analytics closer to its employees with KNIME. Arne Beckhaus is the man behind this project. We are lucky enough today to welcome him for an interview in our KNIME blog.

Will They Blend? Experiments in Data & Tool Blending. Today: OCR on Xerox Copies meets Semantic Web. Have Evolutionary Theories changed?

Mon, 07/03/2017 - 11:04Dario Cannone

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: OCR on Xerox Copies meets Semantic Web. Have Evolutionary Theories changed?

Author: Dario Cannone, Data Analyst, at Miriade S.P.A., Italy

The Challenge

Scientific theories are not static over time. As more research studies are completed, new concepts are introduced, new terms are created and new techniques are invented. This is of course also true for evolutionary theories. That is, evolutionary theories themselves have evolved over time!

In today’s challenge we are going to show how the theory of evolution has evolved from the first Darwin’s formulation to the most recent discoveries.

The foundation stone of evolutionary biology is considered to be the book “On the Origin of Species” (1859) by Charles Darwin. This book contains the first revolutionary formulation of the theory of evolutionary biology. Even though the book at the time has produced a revolution in the approach to species evolution, many of the concepts illustrated there might seem now incomplete or even obsolete. Notice that it was published in 1859, when nothing was known about DNA and very little about genetics.

In the early 20th century, indeed, the Modern Synthesis theory reconciled some aspects of Darwin’s theory with more recent research findings on evolution.

The goal of this blog post is to represent the original theory of evolution as well as the Modern Synthesis theory by means of their main keywords. Changes in the used keywords will reflect changes in the presented theory.

Scanned Xerox copies of Darwin’s book abound on the web, like for example at http://darwin-online.org.uk/converted/pdf/1861_OriginNY_F382.pdf. How can we make the contents of such copies available to KNIME? This is where Optical Character Recognition (OCR) comes into play.

On the other side, to find a summary of the current evolutionary concepts we could just query Wikipedia, or better DBPedia, using semantic web SPARQL queries.

Xerox copies on one side, read via OCR, and semantic web queries on the other side. Will they blend?

Topic. Changes in the theory of evolution.

Challenge. Blend a Xerox copy of a book with semantic web queries.

Access Mode. Image reading, OCR library, SPARQL queries.

Topic Extraction: Optimizing the Number of Topics with the Elbow Method

Mon, 06/19/2017 - 10:56knime_admin

Authors: Andisa Dewi and Kilian Thiel

In a social networking era where a massive amount of unstructured data is generated every day, unsupervised topic modeling has became a very important task in the field of text mining. Topic modeling allows you to quickly summarize a set of documents to see which topics appear often; at that point, human input can be helpful to make sense of the topic content. As in any other unsupervised-learning approach, determining the optimal number of topics in a dataset is also a frequent problem in the topic modeling field.

In this blog post we will show a step-by-step example of how to determine the optimal number of topics using clustering and how to extract the topics from a collection of text documents, using the KNIME Text Processing extension.

You might have read one or more blog posts from the Will They Blend series. This blog post series discussed blending data from varied data sources. In this article today, we’re going to turn that idea on its head. We collected 190 documents from RSS feeds of news websites and blogs for one day (06.01.2017). We know that the documents are divided largely into two categories, sports and barbeques. In this blog post we want to separate the sports documents and barbeque documents by topic and determine which topics were most popular on that particular day. So, the question is will they unblend?

The Wisdom of the KNIME Crowd: the KNIME Workflow Coach

Wed, 06/07/2017 - 10:23phil

Everyone who has heard of KNIME Analytics Platform knows that KNIME has nodes. Thousands of them! The resources under the Learning Hub as well as the hundreds of public examples within KNIME Analytics Platform are all designed to get you up to speed with KNIME and its nodes. But those that know best how to use KNIME nodes are KNIME users themselves. What if we could capture all their insight and experience in understanding which nodes to use when and in what order and give you a recommendation? Well that is exactly what the KNIME Workflow Coach does.

It gathers the usage data of all KNIME users who have registered to have their data collected anonymously and makes recommendations based on that data to you the user. Since a picture is worth a thousand words, let’s take a brief look at the Workflow Coach in the following short video:

Data Chef ETL Battles. What can be prepared with today’s data?

Mon, 05/22/2017 - 11:14rs

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: Customer Transactions. Money vs. Loyalty.

Author: Rosaria Silipo
Data Chefs: Haruto and Momoka

Ingredient Theme: Customer Transactions

Today’s dataset is a classic customer transactions dataset. It is a small subset of a bigger dataset that contains all of the contracts concluded with 9 customers between 2008 and now.

The business we are analyzing is a subscription-based business. The term “contracts” refers to 1-year subscriptions for 4 different company products.

Customers are identified by a unique customer key (“Cust_ID”), products by a unique product key (“product”), and transactions by a unique transaction key (“Contract ID”). Each row in the dataset represents a 1-year subscription contract, with the buying customer, the bought product, the number of product items, the amount paid, the payment means (card or not card), the subscription start and end date, and the customer’s country of residence.

Subscription start and end date usually enclose one year, which is a standard duration for a subscription. However, a customer can hold multiple subscriptions for different products at the same time, with license coverages overlapping in time.

What could we extract from these data? Finding out more about customer habits would be useful. What kind of information can we collect from the contracts that would describe the customer? Let’s see what today’s data chefs are able to prepare!

Topic. Customer Intelligence.

Challenge. From raw transactions calculate customer’s total payment amount and loyalty index.

Methods. Aggregations and Time Intervals.

Data Manipulation Nodes. GroupBy, Pivoting, Time Difference nodes.

The KNIME Model Process Factory

Mon, 05/08/2017 - 11:06knime_admin

Authors: Iris Adä & Phil Winters

The benefits of using predictive analytics is now a given. In addition, the Data Scientist who does that is highly regarded but our daily work is full of contrasts. On the one hand, you can work with data, tools and techniques to really dive in and understand data and what it can do for you. On the other hand, there is usually quite a bit of administrative work around accessing data, massaging data and then putting that new insight into production - and keeping it there.

In fact, many surveys say that at least 80% of any data science project is associated with those administrative tasks. One popular urban legend says that, within a commercial organization trying to leverage analytics, the full time job of one data scientist can be described as building and maintaining a maximum of four (yes 4) models in production - regardless of the brilliance of the toolset used. There is a desperate need to automate and scale the modelling process, not just because it would be good for business (after all, if you could use 29000 models instead of just 4, you would want to!) but also because otherwise we data scientists are in for a tedious life.

At the recent KNIME Spring Summit in Berlin, one of the most well received presentations was that of the KNIME Model Process Factory, designed to provide you with a flexible, extensible and scalable application for running and monitoring very large numbers of model processes in an efficient way.

The KNIME Model Factory is composed of a white paper, an overall workflow, tables that manage all activates and a series of workflows, examples and data for learning to use the Factory.

Subscribe to KNIME news, usage, and development