In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at firstname.lastname@example.org.
Today: SparkSQL meets HiveQL. Women, Men, and Age in the State of Maine
Authors: Rosaria Silipo and Anna Martin
After seeing the foliage in Maine, I seriously gave a thought of moving up there in the beauty of nature and in the peace of a quieter life. I then started doing some research on Maine, its economy and its population.
As it happens, I do have the sampled demographics data for the state of Maine for the years 2009-2014, as part of the CENSUS dataset.
I have the whole CENSUS dataset stored on a Apache Hive installation on a Cloudera cluster running on the Amazon cloud. It could then be processed on Apache Hive or on Apache Spark using the KNIME Big Data Extensions.
News!!! KNIME Big Data Extensions have been open sourced with the last release of KNIME Analytics Platform 3.5. All Big Data nodes in the Node Repository now require no license to run. Check the “What’s new in KNIME 3.5” page for more details on the new release.
KNIME Big Data Extensions offer a variety of nodes to execute Apache Spark or Apache Hive scripts. Hive execution relies on the nodes for in-database processing. Spark execution has its dedicated nodes. However, it also provides an SQL integration to run SQL queries on the Apache Spark execution engine.
We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. On Apache Hive or on Apache Spark? Why not both? We could use SparkSQL to extract men’s age distribution and HiveQL to extract women’s age distribution. We could then compare the two distributions and see if they show any difference.
But the main question, as usual, is: will SparkSQL queries and HiveQL queries blend?
Topic. Age distribution for men and women in the US state of Maine
Challenge. Blend results from Hive SQL and Spark SQL queries.
Access Mode. Apache Spark and Apache Hive nodes for SQL processing