Data science during COVID-19: Some reassembly required
The enormous impact of the COVID-19 pandemic is obvious. What many still haven’t realized, however, is that the impact on ongoing data science production setups has been dramatic, too. Many of the models used for segmentation or forecasting started to fail when traffic and shopping patterns changed, supply chains were interrupted, and borders were locked down.
In short, when people’s behavior changes fundamentally, data science models based on prior behavior patterns will struggle to keep up. Sometimes, data science systems adapt reasonably quickly when the new data starts to represent the new reality. In other cases, the new reality is so fundamentally different that the new data is not sufficient to train a new system. Or worse, the base assumptions built into the system just don’t hold anymore, so the entire process from model creation to production deployment must be revisited.
This post describes different scenarios and a few examples of what happens when old data becomes completely outdated, base assumptions are no longer valid, or patterns in the overall system change. I then highlight some of the challenges data science teams face when updating their production system and conclude with a set of recommendations for a robust and future-proof data science setup.
Data science impact scenario: Data and process change
The most dramatic scenario is a complete change of the underlying system — one that not only requires an update of the data science process but also a revision of the assumptions that went into its design in the first place. This requires a full new data science creation and productionization cycle: understanding and incorporating business knowledge, exploring data sources (possibly to replace data that doesn’t exist anymore), and selecting and fine-tuning suitable models. Examples include traffic predictions (especially near suddenly closed borders), shopping behavior under more or less stringent lockdowns, and healthcare-related supply chains.
A subset of the above is the case where the availability of the data has changed. An illustrative example here is weather predictions, where quite a bit of data is collected by commercial passenger aircraft that are equipped with additional sensors. With the grounding of those aircraft, the volume of available data has been drastically reduced. Because base assumptions about weather systems remain the same (ignoring for a moment that changes in pollution and energy consumption may affect the weather as well) “only” a retraining of the existing models may be sufficient. However, if the missing data represents a significant portion of the information that went into model construction, the data science team would be wise to rerun the model selection and optimization process as well.
Data science impact scenario: Data changes, process remains the same
In many other cases, the base assumptions remain the same. For example, recommendation engines will still work very much the same, but some of the dependencies extracted from the data will change. This is not necessarily very different from, say, a new bestseller entering the charts, but the speed and magnitude of change may be far bigger — as we saw with the sudden spike in demand for health-related supplies. If the data science process has been designed flexibly enough, its built-in change detection mechanism should quickly identify the shift and trigger a retraining of the underlying rules. Of course, that presupposes that change detection was in fact built-in and that the retrained system achieves sufficient quality levels.
Data science impact scenario: Data and process continue to work
This brief list is not complete without stressing that many data science systems will continue to work just as they always have. Predictive maintenance is a good example. As long as the usage patterns stay the same, engines will continue to fail in exactly the same ways as before. The important question for the data science team is: Are you sure? Is your performance monitoring setup thorough enough that you can be sure you are not losing quality? Do you even know when the performance of your data science system changes?
As noted in the first two impact scenarios above, change to your data science system could happen abruptly (when borders are closed from one day to the next, for example) or only gradually over time. Some of the bigger economic impacts will become apparent in customer behavior only over time. For example, in the case of a SaaS business, customers may not cancel their subscriptions overnight but over coming months.
Model drift detection is key
One most often encounters two types of production data science setups. There are the older systems that were built, deployed, and have been running for years without any further refinements, and then there are the newer systems that may have been the result of a consulting project, possibly even a modern automated machine learning (AutoML) type of project. In both cases, if you are fortunate, automatic handling of partial model change has been incorporated into the system, so at least some model retraining is handled automatically. However, none of the currently available AutoML tools allow for performance monitoring and automatic retraining, and usually the older, “one shot” projects don’t worry about that either. As a result, you may not even be aware that your data science process has failed.
If you are lucky to have a setup where the data science team has made numerous improvements over the years, chances are higher that automatic model drift detection and retraining have been built-in. However, even then (and especially in the case where a complete model change is required) it is far more likely that the system cannot easily be recreated. Unless all of the steps of your data science process are well documented, and the experts who wrote the code are still with the company, it will be difficult to revisit the assumptions and update the process. The only solution may be to start an entirely new project.
Reinvention vs. reassembly
Obviously, if your data science process was set up by an external consulting team, you don’t have much of a choice other than to bring them back in. If your data science process is the result of an automated machine learning service, you may be able to re-engage that service, but especially in the case of the change in business dynamics, you should expect to be involved quite a bit—similar to the first time you embarked on this project.
One side note here: Be skeptical when someone pushes for supercool new methods. In many cases, a new approach is not needed. Rather, one should focus on carefully revisiting the assumptions and data used for the previous data science process. Only in very few cases is this really a “data 0” problem where one tries to learn a new model from very few data points. Even then, one should also explore the option of building on top of the previous models and keeping them involved in some weighted way. Very often, new behavior can be well represented as a mix of previous models with a sprinkle of new data.
But if your data science development is done in-house, now is the time an integrative and uniform environment that is 100% backward compatible comes in very handy. In such a platform, the assumptions are modeled and documented in one place, allowing well-informed changes and adjustments to be made much more easily. It’s even better if you can validate, test, and deploy the changes into production from that same environment without the need for manual interaction.
Michael Berthold is CEO and co-founder at KNIME, an open source data analytics company. He has more than 25 years of experience in data science, working in academia, most recently as a full professor at Konstanz University (Germany) and previously at University of California (Berkeley) and Carnegie Mellon, and in industry at Intel’s Neural Network Group, Utopy, and Tripos. Michael has published extensively on data analytics, machine learning, and artificial intelligence. Follow Michael on Twitter, LinkedIn and the KNIME blog.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to [email protected].
Copyright © 2020 IDG Communications, Inc.