Today we look at a dataset that supposedly is already clean, joined with the right additional information, and in the right shape and we want to use it to train a prediction model. Unfortunately, a quick glance at the dataset reveals that it still has tons of missing values, it is not normalized, and contains too many too similar features.
In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at firstname.lastname@example.org.
Today: SugarCRM meets Salesforce. Crossing Accounts and Opportunities
Businesses use Customer Relationship Management (CRM) systems to keep track of all their customer related activities – creating leads and opportunities, managing contacts and accounts, sending quotes and invoices, etc. As long as it is somehow related to the stream of revenue, it is (or at least should be) stored in a CRM system.
Since there is more than one CRM solution on the market, there is a distinct chance that your organization uses multiple CRM platforms. While there might be sound reasons for this, it also poses a significant challenge: How do you combine data from several platforms? How do you generate a single, consolidated report that shows you how well the sales activities of your whole company are going?
One option is to export some tables, fire up your spreadsheet software of choice, and paste the stuff together. Then do the same thing next week. And the week after. And the week after that one (you get the point). Doesn’t sound too enticing? Fear not! This is KNIME, and one of our specialties is to save you the frustration of doing things manually. Fortunately, both SugarCRM and Salesforce allow their users to access their services via REST API, and that is exactly what we are going to do in this blog post.
There are a couple of prerequisites here. First of all, you obviously need accounts for SugarCRM and Salesforce. If you don’t have them but still want to try this yourself, you’ll be happy to see that both companies offer free trial licenses:
You can learn more about how to use the REST APIs of SugarCRM and Salesforce here:
Topic. Get a consolidated view of all customer data from two separate platforms
Challenge. Query data from SugarCRM and Salesforce via their APIs
Access Mode. KNIME REST Web Services
We all know that just building a model is not the end of the line. However, deploying the model to put it into production is often also not the end of the story, although a complex one in itself (see our previous Blog Post on “The 7 Ways of Deployment”). Data scientists are increasingly often also tasked with the challenge to regularly monitor, fine tune, update, retrain, replace, and jump-start models - and sometimes even hundreds or thousands of models together.
The aim of this blog post is to highlight some of the key features of the KNIME Deeplearning4J (DL4J) integration, and help newcomers to either Deep Learning or KNIME to be able to take their first steps with Deep Learning in KNIME Analytics Platform.
If you’re new to KNIME, here is a link to get familiar with the KNIME Analytics Platform:
If you are new to the KNIME nodes for deep learning, you can read more in the relevant section of the Node Guide:
With a little bit of patience, you can run the example provided in this blog post on your laptop, since it uses a small dataset and only a few neural net layers. However, Deep Learning is a poster child for using GPUs to accelerate expensive computations. Fortunately DL4J includes GPU acceleration, which can be enabled within the KNIME Analytics Platform.
If you don’t happen to have a good GPU available, a particularly easy way to get access to one is to use a GPU-enabled KNIME Cloud Analytics Platform, which is the cloud version of KNIME Analytics Platform.
In the addendum at the end of this post we explain how to enable KNIME Analytics Platform to run deep learning on GPUs either on your machine or on the cloud for better performance.
The latest release of KNIME Analytics Platform 3.4 has produced many new features, nodes, integrations, and example workflows. This is all to give you a better all-round experience in data science, enterprise operations, usability, learning, and scalability.
Now, when we talk about scalability, the cloud often comes to mind. When we talk about the cloud, Microsoft Azure often comes to mind. That is the reason why KNIME has been integrating some of the Azure products and services.
The novelty of this latest release consists of the example material. If you currently access (or want to access in the future) some Microsoft products, on the cloud, in your KNIME workflow, you can start by having a look at the 11_Partners/01_Microsoft folder in the EXAMPLES server and at the following link on the KNIME Node Guide https://www.knime.com/nodeguide/partners/microsoft.
A little note for the neophytes among us. The KNIME EXAMPLES server is a public KNIME server hosting a constantly growing number of example workflows (see YouTube video “KNIME EXAMPLES Server”). If you are new to a topic, let’s say “churn prediction”, and you are looking for a quick starting point, then you could access the EXAMPLES server in the top left corner inside the KNIME workbench, download the example workflow in 50_Applications/18_Churn_Prediction (50_Applications/18_Churn_Prediction/01_Training_a_Churn_Predictor50_Applications/18_Churn_Prediction/01_Training_a_Churn_Predictor*), and update it to your data and specific business problem. It is very easy and one of the most loved features in the KNIME Analytics Platform.
We built a workflow to train a model. It works fast enough on our local, maybe not so powerful, machine. So far.
The data set is growing. Each month a considerable number of new records is added. Each month the training workflow becomes slower. Shall we start to think of scalability? Shall we consider big data platforms? Could my neat and elegant KNIME workflow be replicated on a big data platform? Indeed it can.
The KNIME Big Data Extensions offers nodes to build and configure workflows to run on the big data platform of choice. The cool feature of the KNIME Big Data Extensions consists in the nodes GUI. The configuration window for each Big Data node has been built as similar as possible to the configuration window of the corresponding KNIME node. The configuration window of a Spark Joiner node will look exactly the same as the configuration window of a Joiner node.
Thus, it is not only possible to replicate your original workflow on a Big Data Platform, it is also extremely easy, since you do not need to learn new scripts or tools instructions. The KNIME Big Data Extensions brings the ease of use of KNIME into the scalability of Big Data.
This video shows how we replicated an existing classical analytics workflow on a Big Data Platform.
The workflows used in the video can be found on the KNIME EXAMPLES server under 50_Applications/28_Predicting_Departure_Delays/02_Scaling_Analytics_w_BigData50_Applications/28_Predicting_Departure_Delays/02_Scaling_Analytics_w_BigData.knwf*
Here's a familiar predicament: you have the data you want to analyze, and you have a trained model to analyze them. Now what? How do you deploy your model to analyze your data?
In this video we will look at seven ways of deploying a model with KNIME Analytics Platform and KNIME Server. This list has been prepared with an eye toward where the output of the deployment workflow goes:
- to a file or database
- to JSON via REST API
- to a dashboard via KNIME's WebPortal
- to a report and to email
- to SQL execution via SQL recoding
- to Java byte code execution via Java recoding
- to an external application
Once you know these options, you will also know which one best satisfies your needs.
The workflows used in the video can be found on the KNIME EXAMPLES server under 50_Applications/27_Deployment_Options50_Applications/27_Deployment_Options*.
Do you remember the Iron Chef battles?
It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.
Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!
Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at email@example.com.
Ingredient Theme: Energy Consumption Time Series. Behavioral Measures over Time and Seasonality Index from Auto-Correlation.
Author: Rosaria Silipo
Data Chefs: Haruto and Momoka
Ingredient Theme: Energy Consumption Time Series
Let’s talk today about electricity and its consumption. One of the hardest problems in the energy industry is matching supply and demand. On the one hand, over-production of energy can be a waste of resources; on the other hand, under-production can leave people without the basic commodities of modern life. The prediction of the electrical energy demand at each point in time is therefore a very important chapter in data analytics.
For this reason, a couple of years ago energy companies started to monitor the electricity consumption of each household, store, or other entity, by means of smart meters. A pilot project was launched in 2009 by the Irish Commission for Energy Regulation (CER).
The Smart Metering Electricity Customer Behaviour Trials (CBTs) took place during 2009 and 2010 with over 5,000 Irish homes and businesses participating. The purpose of the trials was to assess the impact on consumers’ electricity consumption, in order to inform the cost-benefit analysis for a national rollout. Electric Ireland residential and business customers and Bord Gáis Energy business customers who participated in the trials, had an electricity smart meter installed in their homes or on their premises and agreed to take part in research to help establish how smart metering can help shape energy usage behaviors across a variety of demographics, lifestyles, and home sizes. The trials produced positive results. The reports are available from CER (Commission for Energy Regulation) along with further information on the Smart Metering Project. In order to get a copy of the data set, fill out this request form and email it to ISSDA.
The data set is just a very long time series: one column covers the smart meter ID, one column the time, and one column the amount of electricity used in the previous 30 minutes. The time is expressed in number of minutes from 01.01.2009 : 00.00 and has to be transformed back to one of the classic date/time formats, like for example dd.MM.yyyy : HH.mm. The original sampling rate, at which the used energy is measured, is every 30 minutes.
The first data transformations, common to all data chefs, involve the date/time conversion and the extraction of year, month, day of month, day of week, hour, and minute from the raw date.
Topic. Energy Consumption Time Series
Challenge. From time series to behavioral measures and seasonality
Methods. Aggregations at multiple levels, Correlation
Data Manipulation Nodes. GroupBy, Pivoting, Linear Correlation, Lag Column
If you are a KNIME Server customer you probably noticed that the changelog file for the KNIME Server 4.5 release was rather short compared to the one in previous releases. This means by no means that we were lazy! Together with introducing new features and improving existing features, we also started working on the next generation of KNIME Servers. You can see a preview of what is there to come in the so-called distributed executors. In this article I will explain what a distributed executor is and how it can be useful to you.
2019-03-21: We added more comprehensive instructions that will be continually updated. Check out the new documentation here.