In this blog series we experiment with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Amazon S3 meets MS Azure Blob Storage: A Match Made in the Clouds
Today let’s leave the ground and move into the clouds! When we talk about clouds, two options come immediately to mind: the Amazon cloud and the MS Azure cloud. Both clouds offer proprietary bulk repositories (data lakes) to store data: Amazon cloud uses the S3 data repository and MS Azure has loaded the Blob Storage data repository.
Let’s suppose now that for some unpredictable twist of fate, we have ended up with data on both clouds. How could we make the two clouds communicate, so as to collect all data in a single place? It is a well-known fact that clouds rarely talk to each other.
Today’s challenge is to force the Amazon cloud and the MS Azure cloud to communicate and exchange data. That is, we want to blend data stored in an S3 data lake on the Amazon cloud with data stored in a Blob Storage data lake on the MS Azure cloud. Will they blend?
In the process, we will also put a few Excel files into the blender, just to keep our feet on the ground: after all every data analytics process has to deal with an Excel file or two.
Topic. Analyze the commuting time of Maine workers from the new CENSUS file.
Challenge. Blend together CENSUS file ss13hme.csv about homes in Maine and hosted on S3 on the Amazon cloud with file ss13pme.csv about people in Maine and hosted on Blob Storage on the MS Azure cloud.
Access Mode. Connection to Amazon S3 and connection to MS Blob Storage.
The key nodes here are the Connection and the File Picker nodes; respectively the Amazon s3 Connection node and the Amazon File Picker node for one cloud service and the Azure Blob Store Connection node and the Blob Store File Picker node for the other cloud service.
- From Amazon S3
- First, the Amazon S3 Connection node connects to the Amazon S3 service using the credentials and hostname provided in its configuration window.
- Next, the Amazon S3 File Picker node builds the full path of the resource selected on S3 and exports it as a flow variable. The selected file was ss13hme.csv.
- Finally, a File Reader node uses the content of that flow variable to overwrite the URL of the file to be read, therefore accessing the file on the Amazon S3 platform.
- From MS Azure Blob Storage (same procedure with dedicated nodes as for Amazon S3)
- First, the Azure Blob Store Connection node connects to the Azure Blob Storage service using the credentials and hostname provided in its configuration window.
- Next, the Azure Blob Store File Picker node builds the full path of the resource selected on Blob Storage and exports it as a flow variable. Selected file was ss13pme.csv.
- Finally, a File Reader node uses the content of that flow variable to overwrite the URL of the file to be read, therefore accessing the file on the Azure Blob Storage platform.
- Now both data sets are available as data tables in the KNIME workflow.
- ss13pme.csv contains a number of randomly interviewed people in the state of Maine, which should correspond to 1% of the Maine population.
- ss13hme.csv contains information about the house in which the interviewed people live.
- In both files, the interviewed people are uniquely identified by the SERIAL NO attribute.
- The 2 data sets are then joined on SERIAL NO.
- We focused on the ENG, SEX, and JWMNP attributes. ENG is a code that represents fluency in the English language; SEX is a code for the person’s sex; and JWMNP contains the travel time to work. The idea is to see if there is any interesting pattern across such features. A full explanation of all attributes contained in the dataset can be downloaded from: http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict15.pdf . The new CENSUS dataset is publicly available and can be downloaded from http://www.census.gov/programs-surveys/acs/data/pums.html . More big data sources can be found in this article by Bernard Marr: “Big data: 33 Brilliant and Free data Sources for 2016”.
- Missing values in SEX and ENG are substituted with special code 99; JWMNP is renamed as “Travel Time to Work”.
- A Pivoting node builds the two matrices: avg(JWMNP) of ENG vs. SEX and count of ENG vs. SEX.
- SEX codes (1, 2) are mapped respectively to Male and Female, while ENG codes (1, 2, 3, 4) are mapped to text Strings describing fluency in the English language. Look up tables are stored in local Excel files.
This workflow blends data from Amazon S3 and MS Azure Blob Storage and it is downloadable from the KNIME EXAMPLES server under: 01_Data_Access/06_ZIP_and_Remote_Files/04_AmazonS3-MSBlobStorage_Census_Data01_Data_Access/06_ZIP_and_Remote_Files/04_AmazonS3-MSBlobStorage_Census_Data*
Figure 1. This workflow blends data from Amazon S3 and data from MS Azure Blob Storage. As added bonus, it also throws in some Excel data as well.
(click on the image to see it in full size)
Yes, they blend!
We were able to blend together the data coming from the ss13hme.csv file from Amazon S3 and the data from the file ss13pme.csv from Azure Blob Storage. We were also able to isolate different groups of people based on their gender and their fluency in English. Results have been displayed in the bar plot on the left in figure 2.
According to these results, we can see that if you are a woman in Maine and speak no English, your commute to work seems to be much shorter than the commute of other women who speak better English. If you are a woman who lives in Maine and speaks no English yet, my advice would be not to learn it farther if you want to keep your commute short in the future. Wait a moment … this does not make much sense!
Just to double-check, let’s see how big each one of those groups is. The number of interviewed people by sex and English knowledge is reported in figure 2 on the right. Well, the number of non-English speaking women consists of only 31 subjects. Not really the biggest statistical sample I have ever seen. Maybe, just maybe, the conclusions we drew above are not that reliable after all!
Anyway, our conclusions might have unsubstantiated, but the experiment was still successful, because … yes they blend!
Figure 2. On the left: average Travel Time to Work by English proficiency for female and male people; on the right: number of male and female interviewed people in the data set by English fluency.
(click on the image to see it in full size)
Advice. As this story has been teaching us, always keep a healthy degree of skepticism about the results produced by your analysis. There is always the possibility that a mistake or a too small data group even in a very large data set might have invalidated your results.
If you have believed the first conclusions, do not worry! Everybody can make this mistake. A full web site, named Spurious Correlations http://www.tylervigen.com/spurious-correlations, is indeed dedicated to all such impossible ridiculous conclusions that could be easily unmasked just by using some common sense.
If you want to know more about best practices in the data analytics world, to avoid the rookie mistakes, you might be interested in Dean Abbott’s talk and workshop at the upcoming KNIME Spring Summit 2017 in Berlin on March 15-17.
If you enjoyed this, please share it generously and let us know your ideas for future blends.
We’re looking forward to the next challenge. For the next challenge we will tackle the world of quickly accessible databases, by trying to blend data from MS Access and H2 database. Will they blend?
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)