The data science dilemma: Automation, APIs, or custom data science?
As companies place an increasing premium on data science, there is some debate about which approach is best to adopt — and there is no straight up, one-size-fits-all answer. It really depends on your organization’s needs and what you hope to accomplish.
There are three main approaches that have been discussed over the past couple of years; it’s worth taking a look at the merits and limitations of each as well as the human element involved. After all, knowing the capabilities of your team and who you’re attempting to serve with data science influences heavily how to implement it.
The more researchers (people capable of inventing new algorithms), coders (those who can actually write the underlying code to make data science “real”), and classic data scientists (folks who blend data, tools, and expertise) an organization has, the more options there are available to you.
There are also solutions designed for those that might have only a casual user group that probably couldn’t create an analytical workflow from scratch but that could use something as a template to get started. And sometimes organizations conduct data science only by and for business users who don’t want or need to build anything — or understand the data science behind it; they only want to solve or improve a real business case, often as part of an existing application.
Given your people resources and needs, let’s dive into the approaches and which may best suit your business.
Shrink-wrapped data science for business users
About a year and a half ago, we saw a push by companies attempting to automate data science. This movement was designed for business users and basically said organizations didn’t need any of the other groups; an automated solution would just magically tell them what they wanted to know. If you’re a business user, this sounds wonderful, right?
It’s not quite so simple though. First, you need to hope that whoever the black box vendor is, that the one who sells the system will keep up with the latest and greatest technology. This needs to be done so that the system grows with you and continues to provide the insights you want to know.
Second, and most importantly, your data have to be in shape to run them through that system. Surprising as it may sound, this is still one of the biggest hurdles to modern data science. We’ve been talking about the challenge of data wrangling for the past decade and still haven’t solved it. Unless you have very standard types of setups, the data won’t be ready or able to be run through the system without extra effort.
Suppose your data are in great shape though, and you can find an automation solution that is close enough to what you want to know. You don’t need cutting-edge performance, and what you are interested in learning about is not core to your business’ bottom line; it’s ok if the results are a few percent off the optimum. In this case, automated solutions can be fantastic — as long as you recognize the limitations.
Preconfigured, trained models that tackle basic problems
Data science APIs refer to the practice of using preconfigured, trained models. Data science APIs work extremely well for predefined, standard problems; think about things like speech or image classification.
If you are interested in classifying images, for example, you shouldn’t spend the time and energy to collect millions of images to build your own classification system. That’s something you should willingly purchase as a service — you can easily rely on a company that does a great job of it, like Amazon or Google. Just be sure that the data format required by the API is supported; otherwise, things can get a bit complicated.
You also need to gain clarity that the model does what you actually need it to do; that is, it was trained on the right type of data with the right goal in mind. If this is not the case, you might get results that are just similar to what you thought you wanted. This may or may not be sufficient for the problem at hand, of course. A model trained on European animals will still recognize cats and dogs in Australia. It may struggle with a koala, though.
Additionally, if you’re using APIs in production, you probably want to be sure the results are stable and reproducible. It would be terrible if all of a sudden one of your — so far best — customers was classified as “The Worst Ever” just because the technology underneath changed. With external data science APIs, unfortunately, you often can’t count on continuous, backwards-compatible upgrades.
Customization and all that comes with It
Custom data science basically flips all of this around. In this approach, systems can leverage the really messy data; new fields, sources, and types can be accessed to give you what you want.
This is particularly helpful if you work in an environment where every other month someone says, “We could probably improve performance here if we add in this type of analysis or use this other type of data.” Custom data science is adaptable to ongoing change.
An additional benefit of custom data science is that you can pull from different data sources — legacy systems, on-premises, in the cloud, etc. You don’t have to sit around waiting for some mythical data warehouse to show up and bring all of your data together in a nice, clean way. It can be a true mix.
One thing worth noting, however, which is often ignored in the early part of a project, is that you ultimately want to operationalize it; you want to put this stuff into production. It’s a terrible feeling to run something in a test environment and say, “I trained this model — it’s validated in my test data. This all looks good,” and then suddenly, it has to be recoded and handed off to another department to put into production. Instead, you should be able to use the same environment to productionize it immediately.
And for custom data science to work well, you need in-house domain AND data science expertise (or at least great partners). You need people who understand the problem you are trying to solve very well, who can work with data scientists, and put the model to work. After all, you don’t want data scientists to create an application and then never refine or learn from it. These teams must be able to collaborate consistently to get bleeding-edge performance.
You also need reliable, reproducible results. This is another point that is often ignored, but in production, you want to be sure that what you did yesterday is at least related to what you do tomorrow. Similarly, you want backwards compatibility, so if you try to use what you built a year or two ago, you still can.
Over time, packages may change, and without backwards compatibility, you can’t run the original program any more (or worse, it quietly produces totally different results). Also, to adjust it to solve a similar problem based on the original blueprint is almost impossible. Custom data science allows you to do this and much more.
Putting it all together
In preparing to make data science decisions for your organization, there is undoubtedly a lot to consider. Just try to remember these basic guidelines:
- Automation helps to optimize the selection of models. If you don’t want to do it all yourself, this can save a lot of time.
- Data science APIs help you reuse what’s proven. It is not necessary to build an image or speech classification system — there are services out there to help. Use and incorporate them as part of your analytical routine.
- Custom data science provides the power of the mix. It is the most flexible and powerful approach, but you need to be able to incorporate at least some of your in-house expertise. At the same time, it enables you to automate the boring stuff and allows interaction to focus on the most complex and nuanced.
As is often the case with data science, it’s about choice. Automation or prepackaged data science is suitable for better defined problems where standard performance is sufficient.
But if getting the best results is business-critical to you and gives you that competitive edge, you need to invest in custom data science. There is no free lunch here. Cutting-edge data science requires cutting-edge data scientist expertise applied to your data.
As first published in The Next Web.
- The article Principles of Guided Analytics, also by Michael Berthold, looks at the benefits of enabling an interactive exchange between your in-house domain expert and the data scientist.
- Phil Winter's post KNIME Meets KNIME - Will They Blend? tests whether KNIME workflows really are backwards compatible.
- Interested in updating to KNIME Analytics Platform 4.0? Tune in to the What's New webinar on July 25. More infos here.