Data scientist has been named as the sexiest job of the 21st century, according to an article on Harvard Business Review.
One evening in September 2016, when some colleagues and I were sitting around, amiably cleaning up data – as one often does as a data scientist - we started a discussion about what the sexiest job of the 21st century actually entails. In addition to machine learning and parallel computing, it also involves data cleaning, data integration, data preparation, and other more obscure tasks in the data science discipline.
Data integration or, as it is called nowadays, data blending is a key process to enrich the dataset and augment its dimensionality. And since more data often means better models, it is easy to see how data blending has become such an important part of data science. Data blending is an integration process and, like all integration processes, its problem is the diversity of the players involved. So bring on the players!
In a first experiment, we need to get data from a database. The next experiment might require the same data but from another database, which means speaking a slightly different SQL dialect. Including some data from a REST service API could be useful too and when we’re talking REST APIs, we need to be able to parse and integrate XML and JSON formatted data structures. Of course, we can’t leave out the omnipresent Excel file. But what about a simple text file? It should be possible to integrate the data whatever its original format. The most advanced among us might also need to connect to a big data platform which takes us to the question of deciding which one - as they all rely on a slightly different version of Hive and/or Spark.
Integrating data of different types is another thing we data scientists have to take care of: structured and unstructured data, data from a CRM database with texts from a customer experience tool, or perhaps documents from a Kindle with images from a public repository. In these cases, more than with data blending we are dealing with type blending. And mixing data of different types, like images and text frequency measures, can be non-intuitive.
Time blending is another issue. For those of us who are somewhat vestige of an older analytics era, we often have to blend current data with older data from legacy systems. Migrations are costly and resource intensive. So legacy tools and legacy data easily ignore hypes and survive amidst modern technology.
Lab leaders might dream of having a single data analytics tool for the whole lab, but this is rarely a reality. Which quickly takes us from data blending to tool blending. Legacy tools need to interact with a few contemporary tools either operating in the same sectors or in slightly different sectors. Tool blending is the new frontier of data blending.
After so much discussion, my colleagues and I came to the conclusion that a series of blog posts to share experiences on the blending topic would help many data scientists who are running a real-life instance of the sexiest job of the 21st century.
Digging up blending memories on YouTube, we decided to experiment the craziest blending tasks in a “Will they blend?” blog post series. All posts from the series have now been collected in a book to pass data blending know-how to the next generation of data scientists.
I hope you will enjoy these blending stories as much as we did.