In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at email@example.com.
Today: Local vs. remote files. Will blending overcome the distance?
Today’s challenge is distance: physical, geographical distance … between people and between compressed files.
Distance between people can easily be solved by any type of transportation. A flight before Christmas can take you back to your family just in time for the celebrations. What happens though if the flight is late? Better choose your airline carrier carefully to avoid undesired delays!
Distance between compressed files can easily be solved by KNIME. A few appropriate nodes can establish the right HTTP connection, download the file, and bring it home to the local files.
The goal is to visualize the ratio of departure delays in Chicago airport by carrier through a classic bar chart. We will take the data from the airline dataset and we will focus on two years only: 2007 and 2008. I worked on this dataset for another project and I already have the data for year 2008 zipped and stored locally on my laptop. I am missing the data for year 2007 but I can get them via the URL of the original web site.
So on the one hand I have a ZIP file with the 2008 data from the airline data set here on my laptop. And on the other side I have a link to a ZIP file with the 2007 data on some server in some remote location, possibly close to the North Pole. Will KNIME fill the distance? Will they blend?
Topic. Departure delays by carrier.
Challenge. Collect airline data for 2007 and 2008 and display departure delay ratio by carrier from Chicago airport.
Access Mode. One file is accessed locally and one file is accessed remotely via an HTTP connection.
- Access the LOCAL file for year 2008 of airline data
Airline data for year 2008 have already been downloaded from http://stat-computing.org/dataexpo/2009/the-data.html onto my machine a few weeks ago for a previous experiment. Data was still zipped. So, I used:
- an Unzip Files node to unzip the file content into the knime.workspace folder
- a classic File Reader node to read the content of the unzipped file and import it into the KNIME workflow.
- Access the REMOTE file via HTTP connection for year 2007 of airline data
The data for 2007 still have to be downloaded. They were still available on the original URL http://stat-computing.org/dataexpo/2009/2007.csv.bz2 so I could download them via an HTTP connection.
All nodes that deal with remote files can be found in IO/File Handling/Remote in the Node Repository. This sub-category contains nodes to upload, download, delete, change files in a remote location. From KNIME Analytics Platform 3.3 you’ll also find connectors for Amazon S3 and Microsoft Blob Store files.
In this case:
- We first established an HTTP connection to the server URL (http://stat-computing.org) through an HTTP Connection node
- We then downloaded the required file using the Download node
- The downloaded file was compressed. So we used the Unzip Files node to extract it to a local location
- Finally, we used a classic File Reader node, as in item number 1, to read the file content and import it into a KNIME data table
- Blend the two data sets
Now, the lower branch of the workflow (Fig. 1) deals with the 2008 airline data from the local file, while the upper branch handles the 2007 airline data from the remote file. After removing all cancelled flights on both sides, we used a Concatenate node to put both data sets into a single data table.
Figure 1. This workflow successfully blends data from a local and a remote file location. The remote file is downloaded through an HTTP connection and then unzipped and read like the local file.
(click on the image to see it in full size)
Figure 2. Bar chart of departure delay ratio by carrier for year 2007 and 2008.
(click on the image to see it in full size)
The workflow is available on the KNIME EXAMPLES server under 01_Data_Access/06_ZIP_and_Remote_Files/03_ZIP_Local_vs_remoteHTTP01_Data_Access/06_ZIP_and_Remote_Files/03_ZIP_Local_vs_remoteHTTP*.
Yes, they blend!
By looking at the chart we can see that if you had taken an ExpressJet (EV) flight from Chicago in 2007 you would have been delayed at departure one out of two times. Things would have looked better though one year later in 2008. Delta and North West seemed to be the most reliable airlines when departing from Chicago O’Hare airport respectively in 2007 and 2008.
In this post we can safely conclude that KNIME has overcome the distance problem between two compressed files and successfully blended them to build a bar chart about airline departure delay ratios.
Again the most important conclusion is: Yes, they blend!
If you enjoyed this, please share this generously and let us know your ideas for future blends.
We’re looking forward to the next challenge. There we will try to blend data from Amazon S3 with data from Microsoft BlobStorage. Will they blend?
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)