On the process for building software with ML components

8 min readNov 1, 2020

Software engineering education seems to focus a lot on process models (“software development life cycles”) like waterfall, spiral and agile. A number of models also describe the process of developing ML components (“machine learning workflow”). They don’t seem to fit well together, which causes a lot of confusion about the role that software engineering, MLOps and others play in building software systems with ML components, and where the real engineering challenges arise. Far from complete, here is the status of my current thinking… Feedback welcome.

ML Components in Software Systems

Outside of pure science projects, the ML model created with a machine learning framework is rarely the end of a project, but it is a component of a larger system that should be eventually used in production. It may be a smaller component in a larger traditional system, such as integrating the ”design ideas” function into PowerPoint, or building a production system around an innovative ML solution, say building a business like temi.com around a sophisticated speech recognition component. Even in the latter case, where the ML model is certainly the core of the entire application and probably drives 90% of the effort, it still needs to be integrated with may other non-ML components, such as a web frontend, a payment processing system, a data management solution to upload and process audio files, and means to collect telemetry data from an editor for monitoring and later improvement.

A simple architectural diagram of a transcription service — Sloppy architectural diagram of a possible online transcription service, with speech recognition as the central ML components and several other non-ML components.

While it is hard to specify what the ML component does (see lengthy discussion elsewhere), it usually has a very clear interface and is easy to encapsulate behind a REST API: Given some input (audio, picture, feature vector, …) return a prediction (class, number, text, …). This component now needs to be integrated with the rest of the system, for example, by calling it on all received audio files in the transcription example.

Let’s talk about process

Process models for software development (software development life cycles) tend to have a bad reputation, but they are useful abstractions to think about how to build a system. Many process models exist but they all tend to go through stages from requirements to design to implementation to quality assurance and they all involve some form of iteration (yes, even the original 1970 waterfall paper is explicit about the need for iteration).

Let’s start with a generic iterative model:

Simple iterative process model (software development life cycle)

In practice, we do not want to develop the entire system at once, but break it into components that, if we have clear interfaces, can be mostly developed in parallel (for the sake of everybody, I’m skipping a lengthy lecture on information hiding and its limits here).

System development process by decomposing the problem and integrating the components — Figure from Dogru, Ali H., and Murat M. Tanik. “A process model for component-oriented software engineering.” *IEEE Software* 20, no. 2 (2003): 34–41.

In an ideal setting, we’ll decompose the system into components in an early high-level design step, define interfaces, and then develop those components possibly in parallel.

In practice, nothing will be as clean; we won’t have the perfect decomposition upfront, interfaces won’t remain stable, new components will be added, not everything is done in parallel and there will be iteration within the component development, and teams working on components will probably need to coordinate somewhat.

So, back to ML…

As discussed earlier, ML models are essentially one component of the larger system. In the process above, simply one or more of the components will be an ML component — usually with a fairly obvious interface.

There are a couple of process models specifically for machine learning, such as this from Microsoft:

Machine learning workflow model from Microsoft — Machine Learning Workflow from Amershi, Saleema, et al. “Software engineering for machine learning: A case study.” *2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)*. IEEE, 2019.

This model also starts with requirements, goes through several stages regarding data collection and learning, and ends with quality assurance, deployment, and monitoring. It also emphasizes iteration.

Several similar models exist, but they all have similar stages and similarly emphasize iteration, for example:

Three different machine learning life cycle models — left: CRISP-DM (cross-industry standard process for data mining) model, middle: Microsoft’s Team Data Science Process; right: refined CRISP-DM model from a recent study in the FinTech industry

Interestingly, all these process models clearly go beyond just producing the model and include deployment and monitoring as well, but they do not consider the larger system into with the ML component needs to be integrated with many other non-ML components.

So if we consider the above machine learning workflow model as mechanism to develop the ML component, we could come up with an overall model something like this, where ML components are developed with a different process than other components in the system:

Revision of previous process model, integrating the machine learning workflow into the box of one component

Beyond the ML model…

The problem with the view above is that it does not well correspond to the artifacts that are produced. The machine learning workflow does not just produce and deploy the ML model as a single component. In fact, a key insight into building production ML systems is that one should step away from the focus on the ML model but instead focus on the pipeline that produces the model (e.g., see for example Katie O’Leary and Makoto Uchida. “Common problems with Creating Machine Learning Pipelines from Existing Code.” Proc. Third Conference on Machine Learning and Systems, 2020), so that we can reproduce and update the model. Also the machine learning workflow includes thinking about monitoring, which goes beyond just delivering the ML model as a component.

I think of this as producing at least three artifacts, that each could be represented as their own components:

The pipeline component, that is, a component that takes data, processes it, and produces and packages an ML model.
The monitoring component, that is, a component that takes telemetry data collected by other parts of the system and produces reports and possibly triggers the pipeline component to rebuild a model.
The actual ML model component produced by the pipeline and integrated into the production system.

Not that the first two components are non-ML components. They are implemented and not learned (sure, they may use some learning internally for data cleaning, labeling, model search and so forth, see AutoML and projects like Snorkel, Overton, or HoloClean).

So we should probably really update the architectural model of our transcription service example with something more like this:

Revised architectural model that highlights monitoring and pipeline as two additional components

Also these components are probably developed together with a lot of internal iteration, so let’s just put them in the same box in the overall process model. So, while I’m not quite sure how to draw this, maybe this is a better process visualization:

Revised process model that adapts the machine learning workflow in one component to emphasize the outputs

Back to interfaces: While the model component has a really simple and clean interface (data in, prediction out), the interfaces for the pipeline component and the monitoring components require more thought: The pipeline component receives lots of training data and labels and possibly all kinds of other information, that interact heavily with other parts of the system, to produce the model component. The monitoring component receives telemetry data to produce reports, trigger retraining, etc. This seems far less obvious to integrate into a larger system and needs much more consideration, planning, and design at the system level.

A way to unroll the model and make it maybe more readable might be something like this:

Unrolled process model showing an ML and a non-ML component in parallel

Here we clearly develop ML components and non-ML components in parallel. At the planning and high-level stage (as well as the requirements stage), we need to not only consider the ML models, but also the pipeline components and the monitoring components and how they all interact with other parts of the system.

In practice, in many ML-heavy projects, the model is developed first, before stepping back and thinking about the other components in the system and also before developing the pipeline component and the monitoring components. It will include a lot of iteration in any case.

Thinking through Engineering Challenges for ML-Enabled Systems

The model above will certainly not be the final design, but it helps me to separate a lot of concerns and understand better where the challenges lie in building production ML systems that go beyond “just” learning a model (or how to teach this). Lots of discussions do not fit neatly into the traditional machine learning workflow models, but make much more sense when considering the various ML and non-ML components of the system in a larger process. For example:

Designing user interfaces has direct impact on what data can be collected as telemetry for retraining and monitoring. This requires careful thinking about how multiple non-ML and ML components interact and what the interfaces for the pipeline and monitoring components are.
Quality assurance needs to be planned holistically: Evaluating accuracy of the ML model is only one step, similar to unit testing. We also need to create unit tests for the pipeline component and the model component, as emphatically argued by this excellent paper from Google. And of course, we need integration testing and system testing that looks at the interactions of all these components, including one or more ML components.
Data quality is really at the interface between the ML pipeline component and the rest of the system. It requires planning and negotiation at the system level and likely requires to describe data quality requirements as part of the interface. The same holds for assumptions about data schema and drift.
Fairness considerations are really about requirements engineering and require to synchronize the system requirements with those in the machine learning pipeline, with consequences for how the pipeline component and probably also the monitoring component are implemented, but also with consequences for other components, say regarding what data is collected. This requires paying attention to fairness as part of system requirements, as part of high-level design, and then as part of several ML and non-ML components.
Safety is a system-level property and can, despite a lot of research on robustness of ML models, not be achieved at the level of the ML model alone. It’s really about designing how the ML model component interacts with other (safety) components of the system, such as fallback mechanisms, users interactions, and much more.

As usual, I’d be curious to hear your thoughts. Thanks to Nadia Nahar, Shurui Zhou, and Grace Lewis for discussions on these ideas. We hope to use these in a paper somewhat soon. Stay tuned.

PS: While I have your attention, have a look on our course for building Production ML System and at my annotated bibliography on the topic.