Data Science in Times of Change: (Some) Re-Assembly Required

Tue, 06/16/2020 - 09:00 berthold

Most likely, the assumptions behind your data science model or the patterns in your data did not survive the coronavirus pandemic. Here’s how to address the challenges of model drift.

Data Science in Times of Change: (Some) Re-Assembly Required

By Michael Berthold (KNIME).

The enormous impact of the current crisis is obvious. What many still haven’t realized, however, is that the impact on ongoing data science production setups can be dramatic, too. Many of the models used for segmentation or forecasting started to fail when traffic and shopping patterns changed, supply chains were interrupted, borders were locked down, and just in general the way people behaved changed fundamentally.

Sometimes, data science systems adapt reasonably quickly when the new data starts to represent the new reality. In other cases, the new reality is so fundamentally different that the new data is not sufficient to train a new system or, worse, the base assumptions built into the system just don’t hold anymore so the entire process from data science creation to productionizing must be revisited.

This post describes different scenarios and a few examples for what happens when old data becomes completely outdated, base assumptions are not valid anymore, or patterns in the overall system change. I then highlight some of the challenges data science teams face when updating their production system and conclude by a set of recommendations for a robust future proof data science setup.

Impact Scenario: Complete Change

The most drastic scenario is a complete change of the underlying system that not only requires an update of the data science process itself but also a revision of the assumptions that went into its design in the first place. This requires a full new data science creation and productionization cycle: understanding and incorporating business knowledge, exploring data sources (possibly to replace data that doesn’t exist anymore) and select and fine tune suitable models. Examples include traffic predictions (especially near suddenly closed borders), shopping behaviour under more or less stringent lock downs, and healthcare related supply chains.

A subset of the above is the case where the availability of the data changed. A very illustrative example here are weather predictions where quite a bit of data is collected by commercial passenger aircrafts that are equipped with additional sensors. With the grounding of those aircraft, the volume of data has been drastically reduced.. Because base assumptions about weather development itself remain the same (ignoring for a moment that other changes in pollution and energy consumption may affect the weather as well) “only” a retraining of the existing models may be sufficient. However, if the missing data represents a significant portion of the information that went into model construction the data science team is well advised to rerun the model selection and optimization process as well.

Impact Scenario: Partial Change

In many other cases the base assumptions remain the same. For example recommendation engines will still work very much the same, but some of the dependencies extracted from the data will change. This is not necessarily very different from, say, a new bestseller entering the charts, but the speed and magnitude of change may be bigger: the way for instance health related supplies jumped in demand, outpaces how a bestseller rises in the charts. If the data science process has been designed flexibly enough, its built-in change detection mechanism should quickly identify the change and trigger a retraining of the underlying rules. Of course, that presupposes that change detection was in fact built-in and that the retrained system achieves sufficient quality levels.

Impact Scenario: No Change

This brief list is not complete without stressing that many concepts remain the same: predictive maintenance is a good example. As long as the usage patterns stay the same, engines will continue to fail in exactly the same ways as before. But the important question here for your data science team is: Are you sure? Is your performance monitoring setup thorough enough that you can be sure you are not losing quality? This is a predominant theme these days anyway: do you even notice when the performance of your data science system changes?!

A little side note on Model Jump vs Model Shift which is also often used in the context but refers to a different aspect. In the first two scenarios above (Complete/Partial Change) that change can happen abruptly (when borders are closed from one day to the next, for example) or gradually over time. Some of the bigger economic impacts will become apparent in customer behaviour only over time. For example, in the case of a SaaS business, customers will not cancel their subscriptions overnight but over coming months).

What’s the Problem?

In reality, one most often encounters two types of production data science setups. There are the older systems that were built, deployed, and have been running for years without any further refinements, and then there are the newer systems that may have been the result of a consulting project, possibly even a modern automated machine learning (AutoML) type of project. In both cases, if you are fortunate, automatic handling of partial model change has been incorporated into the system so at least some model retraining is handled automatically. But since none of the currently available AutoML tools allow for performance monitoring and automatic retraining and usually one-shot projects don’t worry about that either, you may not even be aware that your data science process has failed.

If you are lucky to have a setup where the data science team has made numerous improvements over the years, chances are higher that automatic model drift detection and retraining is built-in as well. However, even then - and especially in case of a complete model jump - it is far more likely that the existing system cannot easily be recreated to accommodate the new setup because all those steps are not well documented, making it hard to revisit the assumptions and update the process. Often also the process relies on intransparent code pieces, written by experts who have left the team in the meantime. The only solution? Start an entirely new project. 

What’s Needed?

Obviously, if your data science process was set up by an external consulting team you don’t have much of a choice than to bring them back in. If your data science process is the result of an automated ML/AI service, you may be able to re-engage that service, but especially in the case of the change in business dynamics you should expect to be involved quite a bit - similar to the first time you embarked on this project.

One side note here: Be skeptical when someone is trying to push for super-cool new methods. In many cases this is not needed but one should rather focus on carefully revisiting the assumptions and data used for the previous data science process. Only in very small cases is this really a “data 0” problem where one tries to learn a new model from very few data points. Even then one should also explore the option of building on top of the previous models and keep them involved in some weighted way. Very often, new behavior can very well be represented as a mix of previous models with a sprinkle of new data.

But if your data science development is done inhouse - now is the time where an integrative and uniform environment such as KNIME comes in very handy. KNIME workflows are 100% backwards compatible and the underlying assumptions are all visually documented in one environment, allowing for well-informed changes and adjustments to be made. Using KNIME Server, you then validate and test the revised workflows and deploy it to production from that same environment without any need for manual translation to a different production environment.