In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Today’s challenge is sport related. How well do you know Major League Baseball? Do you know who had the best pitching and batting record between 1985 and 1990? Do you know who has been the highest-paid baseball player of all times?
For a long time, baseball has been, and arguably still is, the most data focused sport. The most famous documentation of data analytics in Major League Baseball is, of course, the Moneyball book (and subsequent movie), but data analysis in baseball has a much longer history.
For this blending challenge, we used batting and pitching statistics for all active players from 1985 to 2015 in the National League and the American League. Such data has been made publicly available through the Sean Lahman’s website. (A loud “thank you” to all of the site contributors for making this standard baseball encyclopedia publicly available.) The Lahman Database stores player statistics as well as data about managers, birthdates, awards, all-star games, and much more.
Now, in most companies each department owns specific data, sometimes even using different separated databases. For instance, salaries and demographic data are often owned by HR, while performance metrics are owned by Operations. In this experiment, we assume the HR Department hosts salaries and demographics data in a MS Access database and the Operations Department stores the performance metrics (batting and pitching stats, among others) inside an H2 database.
MS Access is part of the Microsoft Office package and therefore available on most Windows based PCs. H2 is a relatively new open source database downloadable at http://www.h2database.com/html/download.html.
Today’s technical challenge is to attempt to blend data from a MS Access database and an H2 database. Will they blend?
Afterwards, on the blended data, we will give a short-guided tour of the KNIME WebPortal, to detect the best-paid and/or best-performing baseball players for each decade.
Topic. Best paid and best performing baseball players from 1985 to 2015.
Challenge. Blend data from MS Access and H2 databases and guide the users through the analytics on a web browser.
Access Mode. Database Connectors and KNIME WebPortal.
Accessing MS Access database
Accessing H2 database
Blending data from MS Access and H2 databases.
After some transformation and cleaning in the “Preprocessing and joining data” wrapped node, which involved computation and normalization of the average stats for each player as well as normalization of salaries through the CPI (Consumer Price Indexes), we created a sequence of 4 webpages to be displayed on the KNIME WebPortal:
The final workflow is available for download on the EXAMPLES server under EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2*.
Note. If you are using a different database, just change the connector node. In the category, Database/Connector in the Node Repository, a number of dedicated connector nodes are available. If you cannot find yours, you can always use the generic Database Connector node. The Database Connector node can connect to all databases, if the appropriate JDBC driver is is provided.
Yes, they blend! The database blending was successful. The question though is: are the players worth the money?
The scatter plot in page 1 on the WebPortal (Fig. 2) shows the growth of the average salary over the years. As said, salaries have been normalized with. That is quite some growth! Was it worth it?
Let’s select a time frame on page 2, for example 1985 to 1990.
On page 3, let’s select 3 of the top 10 best-paid pitchers who were active between 1985 and 1990: Orel Hershiser, Dan Quisenberry and Dwight Gooden. Who was the best pitchers at that time?
The answer to our question is in page 4. The tables in Fig. 3, extracted from page 4, summarize the pitching stats in a radar plot for each player. Clearly, all the three selected pitchers exhibit great pitching stats, but Orel Hershiser seems to have the best records.
Now it’s your turn and you can have a look at who was the best batters for the selected time frame! Who is your favorite baseball player? Using our workflow, you can check out their batting and pitching stats, directly in KNIME Analytics Platform or on a web browser!
This WebPortal application is the proof of the successful blending of data from MS Access and H2 database. The most important conclusion of this experiment is again: Yes, they blend!
The workflow is available on the KNIME EXAMPLES server under EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2*.
If you enjoyed this, please share it generously and let us know your ideas for future blends.
We’re looking forward to the next challenge. What about blending XML formatted data with JSON formatted data? Will they blend?
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)