KNIME news, usage, and development

15 Jan 2018admin

Authors: Daria Goldmann and Greg Landrum

In a recent blog post, we discussed creating web services using KNIME Analytics Platform and KNIME Server - now we want to look at calling web services with KNIME.

Since this post is from the Life Sciences team at KNIME and we’ve been investigating ChEMBL web services recently, we’d like to use them as an example here. Please note that there is a set of community KNIME nodes for accessing ChEMBL and ChEBI and we are intentionally duplicating some of that functionality here.

ChEMBL itself is a great Open Data resource. It provides a large collection of linked information on compounds and their structures, biological targets and their sequences, biological assays and their experimental details. The data are largely collected from scientific publications with each entry in the database represented by a unique identifier - a ChEMBL ID. It’s all freely available for download in relational form or can be accessed using a REST API. That’s what we look at here.

Don’t stop reading... if you’re from another field and not really interested in ChEMBL or the data it contains! The patterns we use here for interacting with the web services and looking at the results will work for many other RESTful web APIs.

Read more


08 Jan 2018admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: A Recipe for Delicious Data – Part 2: The new Google Sheets Nodes

Authors: Rene Damyon and Oleg Yasnev

Post Update!

This is the updated version of the original blog post “A Recipe for Delicious Data: Mashing Google and Excel Sheets”, using the new Google Sheets nodes available in KNIME Analytics Platform 3.5.
 

The Challenge

Remember this blog post from July 2017?

A local restaurant has been keeping track of its business on Excel in 2016 and moved to Google Sheets in 2017. The challenge was then to include data from both sources to compare business trends in 2016 and in 2017, both as monthly total and Year To Date (YTD) revenues.

The technical challenge of this experiment was then of the “Will they blend?” type: mashing the data from the Excel and Google spreadsheets into something delicious… and digestible. The data blending was indeed possible and easy for public Google Sheets. However, it became more cumbersome for private Google Sheets, by requiring a few external steps for user authentication.

From the experience of such a blog post, a few Google Sheets dedicated nodes have been built and released with the new KNIME Analytics 3.5. A number of new nodes indeed are now available to connect, read, write, update, and append cells, rows, and columns into a private or public Google Sheet.

The technical challenge then has become easier: accessing Google Sheets with these new dedicated nodes and mashing the data with data from an Excel Sheet. Will they blend?

Topic. Monthly and YTD revenue figures for a small local business.

Challenge. Retrieve data from Google Sheets using the new Google Sheets nodes available in KNIME Analytics Platform 3.5.

Access Mode. Excel Reader node and Google Sheets Reader node for private and public documents.

Read more


18 Dec 2017Mallika Bose

In early November, we hosted the KNIME Fall Summit for the second time in the US, which brought together KNIME users from all over the world. Watch the live video recordings below to learn about what’s new in KNIME, what’s cooking in the KNIME Labs, and much more!

Welcome Speech by Michael Berthold, CEO of KNIME AG

Read more


11 Dec 2017admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: SparkSQL meets HiveQL. Women, Men, and Age in the State of Maine

Authors: Rosaria Silipo and Anna Martin

The Challenge

After seeing the foliage in Maine, I seriously gave a thought of moving up there in the beauty of nature and in the peace of a quieter life. I then started doing some research on Maine, its economy and its population.

As it happens, I do have the sampled demographics data for the state of Maine for the years 2009-2014, as part of the CENSUS dataset.

I have the whole CENSUS dataset stored on a Apache Hive installation on a Cloudera cluster running on the Amazon cloud. It could then be processed on Apache Hive or on Apache Spark using the KNIME Big Data Extensions.

News!!! KNIME Big Data Extensions have been open sourced with the last release of KNIME Analytics Platform 3.5. All Big Data nodes in the Node Repository now require no license to run. Check the “What’s new in KNIME 3.5” page for more details on the new release.

KNIME Big Data Extensions offer a variety of nodes to execute Apache Spark or Apache Hive scripts. Hive execution relies on the nodes for in-database processing. Spark execution has its dedicated nodes. However, it also provides an SQL integration to run SQL queries on the Apache Spark execution engine.

We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. On Apache Hive or on Apache Spark? Why not both? We could use SparkSQL to extract men’s age distribution and HiveQL to extract women’s age distribution. We could then compare the two distributions and see if they show any difference.

But the main question, as usual, is: will SparkSQL queries and HiveQL queries blend?

Topic. Age distribution for men and women in the US state of Maine

Challenge. Blend results from Hive SQL and Spark SQL queries.

Access Mode. Apache Spark and Apache Hive nodes for SQL processing

Read more


04 Dec 2017Iris

Figure 1. You can find all date and time nodes in the Time Series category of KNIME Analytics Platform

For the KNIME Analytics Platform 3.4 release we did a full rewrite of our Date and Time support. This blog post will get you introduced to the new Date & Time integration and its features.

I will start with the new Date and Time column types, how they differ from the old Date and Time column type, and how they can be converted. Afterwards, I will talk about the features which were included into the new integration.

Before going into any details, I want to share my personal list of highlights with you. They are (in no particular order):

  • Time zone support
  • Multiple columns support
  • Higher flexibility for measuring time differences
  • Auto-guessing of string formats
  • Flow variable support

 

 

The new Date & Time column types

Before this release we had a single column type that could be used for either date only, time only, or date and time. Now, we have a dedicated column type for each of these uses. In addition to these three column types, we now have a column type for date and time with a time zone. We thus have four new column types for representing date and time:

  • Date (e.g. the first of November 2017)
  • Time (e.g., 9:30 A.M.)
  • Date &Time (e.g., the first of November 2017, 9:30 A.M.)
  • Date & Time with zone (e.g., the first of November 2017, 9:30 A.M. in Europe/Berlin ).

The first three represent exactly what the old column types did; the fourth is adding new functionality.

Read more


27 Nov 2017admin

Authors: Dr. Zijlstra and Dr. Angela Eeds

Lu Zheng, a senior from Hume-Fogg Magnet High School in Nashville and participant in the School for Science and Math at Vanderbilt (SSMV), was selected as semifinalist in the US national Siemens Competition in Math, Science and Technology for using computational analysis of tissue-based biomarkers in cancer.

Zheng’s research project, “Assessing the Utility of Computational Analysis of Tissue-Based Biomarkers to Predict Recurrence in Bladder Cancer,” was selected from nearly 2,000 projects submitted. Zheng is one of nine semifinalists from Tennessee and 491 semifinalists overall.

Zheng conducted the research under the mentorship of research assistant professor Shanna Arnold, in the laboratory of Andries Zijlstra, an associate professor in the Department of Pathology, Microbiology and Immunology and the Program of Cancer Biology at Vanderbilt University Medical Center.

Zheng’s work focused on predicting the outcome of bladder cancer patients treated by surgical resection of the bladder (cystectomy) at Vanderbilt University Medical Center. Up to 50% of these patients with bladder cancer will develop distant disease within two years after complete surgical removal of the bladder and associated tumor. Identifying patients who require more aggressive therapy compared to watchful-waiting could improve outcome in patients whose disease is more likely to spread, prevent overtreatment of patients who will not respond to the intervention, and reduce overall healthcare costs.

Read more


20 Nov 2017rs

When I talk to young data science graduates, I often feel that they can train a deep learning model in 5 minutes, but have no idea where to go from there. After and before the model training and evaluation part, there is this big grey area where ideas are confused and directions unclear.

I just returned from the ODSC Europe conference 2017 in London, where I gave a presentation and manned (or better womanned) the KNIME booth in the exhibition hall. One of the gadgets available at the KNIME booth was a series of fridge magnets mimicking the sequence of steps in a data science project. People could choose a magnet and take it back home. Guess what?

Most people chose "Analyze", some chose "Transform", but almost nobody chose "Deploy". I found this scary. Everybody trains models, but nobody makes them work in real life! I think the time has come for a review of the components of a complete data science project.

On this topic, we are planning a series of Learnathon meetup events in various locations around the world. Zurich (CH) Nov 29, London (UK), Berlin (D), New York (USA), Rome/Milan (IT), San Francisco Bay Area (USA), Sao Paulo (BR), Boston (USA), and probably in your city too!

 

Figure 1. 2018 Learnathon locations around the world!

Read more


13 Nov 2017admin

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Google Big Query meets SQLite. The Business of Baseball Games

Author: Dorottya Kiss, EPAM

The Challenge

They say if you want to know American society, first you have to learn baseball. As reported in a New York Times article, America had baseball even in times of war and depression, and it still reflects American society. Whether it is playing, watching, or betting on the games, baseball is in some way always connected to the lives of Americans.

According to Accuweather, different weather conditions play a significant role in determining the outcome of a baseball game. Air temperature influences the trajectory of the baseball; air density has an impact on the distance covered by the ball; temperature influences the pitcher’s grip; cloud coverage affects the visibility of the ball; and wind conditions - and weather in general - have various degrees of influence on the physical wellbeing of the players.

Another interesting article on Crowdhitter describes the fans’ attendance of the games and how this affects the home team’s success. Fan attendance at baseball games is indeed a key factor, in terms of both emotional and monetary support. So, what are the key factors determining attendance? On a pleasant day are they more likely to show up in the evening or during the day, or does it all just depend on the opposing team?

Some time ago we downloaded the data about attendance at baseball games for the 2016 season from Google’s Big Query Public data set and stored them on our own Google Big Query database. For the purpose of this blending experiment we also downloaded data about the weather during games from Weather Underground and stored these data on a SQLite database.

The goal of this blending experiment is to merge attendance data at baseball games from Google Big Query with weather data from SQLite. Since we have only data about one baseball season, it will be hard to train a model for reliable predictions of attendance. However, we have enough data for a multivariate visualization of the various factors influencing attendance.

Topic. Multivariate visual investigation of weather influence on attendance of baseball games.

Challenge. Blend attendance data from Google Big Query and weather data from SQLite.

Access Mode. Database Connector node with Simba 4.2 JDBC driver compatible with access to Google Big Query and dedicated SQLite Connector node.

Read more


06 Nov 2017rs

Unless it is delayed, in which case, you can relax and read this vlog post.

How many flights are delayed each year?

How many flights are delayed at departure and how many are delayed at arrival?

Are some carriers more often delayed than others?

Are flights leaving on Thursdays more likely to be delayed than flights leaving on Sundays?

Are flights leaving Chicago airport more often delayed than flights leaving San Josè airport?

Could we use KNIME to interactively and graphically explore the airline data set and answer all - or at least most of - these questions?

Before we start with any kind of model training for more accurate predictions, it is always useful to examine the status quo and explore the kind of problem we are dealing with. This is where graphical interactive exploration comes in handy. Sunburst charts, box plots, line plots, stacked plots, scatter plots, network graphs, and other visualization techniques can offer some insights into the dataset and particularly into our delayed flights problem.

Read more


30 Oct 2017rs
  • Can KNIME connect to MySQL databases?
  • Sure! KNIME Analytics Platform has dedicated connectors for a number of databases and MySQL is one of them. We also have a generic connector for many other databases. Provided the JDBC driver file, KNIME can connect to most databases through this generic database connector node.
     
  • What about Microsoft SQL Server?
  • Sure! KNIME Analytics Platform has dedicated connectors for a number of databases, including MS SQL Server. Also, provided the JDBC driver file, KNIME can connect to other databases through a generic database connector node.
     
  • What about Oracle?
  • Sure! Provided the JDBC driver file, KNIME can connect to an Oracle database through the generic database connector node.
     
  • What about MongoDB?
  • Sure! KNIME Analytics Platform has a dedicated connector for MongoDB.
     

Read more


Subscribe to KNIME news, usage, and development