Machine Learning in Production: From Models to Systems

21 min readJan 5, 2022

This chapter covers part of the “From Models to AI-Enabled Systems (Systems Thinking)” lecture of our Machine Learning in Production course. For other chapters see the table of content.

In production systems, machine learning is almost always used as a component in a larger system — often a very important component, but usually still just one among many components. Yet, most education and research regarding machine learning focuses narrowly on the learning techniques and models, maybe on the pipeline to create, deploy, and operate models, but rarely on the entire system including ML and non-ML components.

The focus on the entire system is important in many ways though. Many relevant decisions consider the entire system and how the system interacts with users and the environment more broadly. For example, to what degree should the system automatically act based on predictions from machine-learned models and should the system be designed to keep humans in the loop? Such decisions matter substantially for how a system can cope with mistakes and implications for usability, safety, and fairness. Before the rest of the book looks at various facets of building software systems with ML components, let us dive a little deeper into how machine learning relates to the rest of the system and why a system-level view is so important.

ML and Non-ML Components in a System

In production systems, machine learning is used to train models to make predictions that are used in the system. In some systems, those predictions are the very core of the system, whereas in others they provide only an auxiliary feature.

In the transcription service startup from the previous chapter, machine learning provides the very core functionality of the system that converts uploaded audio files into text. Yet, to turn the model into a product, many other (usually non-ML) parts are needed, such as (a) a user interface to create user accounts, upload audio files, and show results, (b) a data storage and processing infrastructure to queue and store transcriptions and process them at scale, (c) a payment service, and (d) monitoring infrastructure to ensure the system is operating within expected parameters.

*Architecture sketch of a transcription system, illustrating the central ML component for speech recognition and many non-ML components.*

At the same time, many traditional software systems use machine learning for some extra “add-on” functionality. For example, a traditional end-user tax software may add a module to predict the audit risk for a specific customer; a word processor may add a better grammar checker; a graphics program may add smart filters, or a photo album application may add automated tagging of friends. In all these cases, machine learning is added as an often relatively small component to provide some added value in an existing system.

*Architecture sketch of the tax system, illustrating the ML component for audit risk as an addition to many non-ML components in the system.*

Traditional Model Focus (Data Science)

Much of the attention in machine learning education and research has been on learning accurate models from given data. Machine learning education typically focuses on how specific machine learning algorithms work (e.g., the internals of SVM or deep neural networks) or how to apply them to train accurate models from provided data. Similarly, machine learning research focuses primarily on the learning steps, trying to improve prediction accuracy of models trained on common datasets (e.g., exploring new deep neural network architectures, new embeddings).

*Typical steps of a machine learning process. Mainstream machine-learning education and research focuses on the modeling steps itself with provided datasets.*

Comparatively little attention is paid (a) at the one end to how data is collected and labeled, and (b) at the other end how the learned models might actually be used for a real task. Rarely is there any discussion of the larger system that might produce the data or use the model’s predictions. Many researchers and practitioners have expressed frustrations with this somewhat narrow focus on model training due to various incentives in the research culture, such as Wagstaff’s 2012 essay “Machine Learning that Matters” and Sambasivan et al.’ 2021 study “Everyone wants to do the model work, not the data work”. Outside of BigTech organizations with lots of experience, this also leaves machine learning practitioners that want to turn models into products with little guidance, as can often be observed in teams and startups that struggle productionizing initially promising models (and Gartner’s claim that over half of the projects with machine-learning prototypes in 2020 did not make it to production).

Automating Pipelines and MLOps (ML Engineering)

With increasing use of machine learning in production systems, engineers have noticed various practical problems of deploying and maintaining machine-learned models. Traditionally, models might be learned in a notebook or with some script, then serialized (“pickled”) and then embedded in a web server that provides an API for making predictions (which we will discuss in more detail in chapter Deploying a Model). However, when used in production systems, scaling the system with changing demand, often in cloud infrastructures, and monitoring service quality in real time become increasingly important. Similarly, with larger data sets and deep learning jobs, the model training itself can become challenging to scale. Also, when models need to be updated regularly, either due to continuous experimentation and improvement (as we will discuss in the Quality Assurance in Production chapter) or due to routine updates to handle various forms of distribution shifts (as we will discuss in the Data Quality chapter), manual steps in learning and deploying models become tedious and error prone. Experimental data science code is often derided as being of low quality by software engineering standards, often monolithic, with minimal error handling, and barely tested — which is not fostering confidence in regular or automated deployments. All this has put increasing attention on distributed training, deployment, quality assurance, and monitoring, supported with automation of machine-learning pipelines, often under the label MLOps.

*Widening the focus from modeling to the entire ML pipeline, including deployment and monitoring, with a heavy focus on automation.*

This focus on entire machine-learning pipelines that include model deployment and monitoring has received significant attention in recent years. It addresses many common challenges of wrapping models into scalable web services, regularly updating them, and monitoring their execution. Increasing attention is paid to scalability achieved in model training and model serving through massive parallelization in cloud environments. While many teams originally implemented this infrastructure for each project and maintained substantial amounts of infrastructure code (described prominently in the 2015 technical debt article from a Google team), these days many competing open source and commercial solutions exist for many of these steps as MLOps tools (more details in later chapters).

*Figure from Google’s 2015* *technical debt* paper, indicating that the amount of code for actual model training is comparably small compared to lots of infrastructure code needed to automate model training, serving, and monitoring. These days, much of this infrastructure is readily available through competing MLOps tools (e.g., serving infrastructure, feature stores, cloud resource management, monitoring).

Researchers and consultants report that shifting a team’s mindset from models to machine-learning pipelines is challenging. Data scientists are often used to working with private datasets and local workspaces (e.g., in computational notebooks) to create models. Migrating code toward an automated machine-learning pipeline, where each step is automated and tested, requires a substantial shift of mindsets and a strong engineering focus. This is not necessarily valued by all team members; for example, data scientists frequently report resenting having to do too much engineering work which prevents them from focusing on their models; though many eventually appreciate the additional benefits of being able to experiment more rapidly in production and deploy improved models with confidence.

ML-Enabled Systems (ML in Production)

Notwithstanding the increased focus on automation and engineering, the broader view of automated machine-learning pipeline and MLOps still is entirely model-centric. It starts with model requirements and ends with deploying the model as a reliable and scalable service, but it usually does not consider other parts of the system and how the model interacts with those. Zooming out, the entire purpose of the machine-learning pipeline is to create a model that will be used as one component of a larger system (and potentially additional components for training and monitoring the model).

*The ML pipeline corresponds to all activities for producing, deploying, and updating the ML component that is part of a larger system.*

As we will discuss throughout this book, key challenges of building production systems with machine-learning components (ML-enabled systems) arise at the interface between these ML components and non-ML components of the system and how they, together, achieve system goals. There is constant tension between the goals and requirements of the overall system and the requirements and design of individual ML and non-ML components:

Requirements for the entire system influence model requirements as well as requirements for model monitoring and pipeline automation. For example, in the transcription scenario, user-interface designers may request to receive confidence scores for individual words and alternative plausible transcriptions to provide a better user experience; operators may set expectations for latency and memory-demand during inference that constrain data scientists in what models they can chose; and advocates and legal experts may suggest what fairness constraints to enforce during training.
Conversely, capabilities of the model influence the design of non-ML parts of the system and assurances we can make about the entire system. For example, in the transcription scenario, accuracy of predictions may influence user-interface design and to what degree humans are kept in the loop to check and fix the automated transcriptions; it may limit what promises about system quality we can make to customers.

Systems Thinking

Given how machine learning is part of a larger system, it is important to pay attention to the entire system, not just the machine-learning components. We need a holistic approach with an interdisciplinary team that involves all stakeholders.

Systems thinking is the name for a discipline that focuses how systems interact with the environment and how components within the system interact. For example, Donella Meadows defines a system as “a set of inter-related components that work together in a particular environment to perform whatever functions are required to achieve the system’s objective.” System thinking postulates that everything is interconnected; that combining parts often leads to new emergent behaviors that are not apparent from the parts themselves; and that it is essential to understand dynamics of a system, where actions have effects and may form feedback loops (as the YouTube conspiracy and gaming examples in the Introduction chapter).

*A system consists of components working together toward the system goal. The system is situated in and interacts with the environment.*

As we will explore throughout this book, many common challenges in building productive systems with machine-learning components are really system challenges that require understanding the interaction of ML and non-ML components and the interaction of the system with the environment.

Beyond the Model

A model-centric view of machine learning allows data scientists to focus on the hard problems involved in training more accurate models and allows MLOps engineers to build an infrastructure that enables rapid experimentation and improvement in production. However, the common model-centric view of machine learning misses many facets of building high quality production systems.

System Quality versus Model Quality

Outside of machine-learning education and research, model accuracy is almost never a goal in itself but a means to support the goal of a system. A system typically has the goal of satisfying some user needs and making money (we will discuss this in more detail in chapter System and Model Goals). The accuracy of a machine-learned model can directly or indirectly support such a system goal. For example, a better audio transcription model is likely to attract more users and sell more transcriptions; predicting the audit risk in tax software provides value to the users of that software and may hence encourage more sales and sales of additional services. In both cases, the system goal is distinct from the model’s accuracy but supported more or less directly by it.

Interestingly, improvements in model accuracy does not even have to translate to improvements toward system goals. For example, experience at Booking.com has shown that improved model accuracy of models predicting different aspects of a customer’s travel preferences which influence hotel suggestions does not necessarily improve hotel sales and in some cases improved accuracy even may negatively impact sales. One possible explanation for such observations offered by the team was that the model becomes too good up to a point where it comes creepy: It seems to know too much about a customer’s travel plans, when they have not actively shared that information. In the end, more accurate models were not adopted if they did not support the system goals.

Observations from online experiments at Booking.com, showing that model accuracy improvement (“Relative Difference in Performance”) does not necessarily translate to improvements of system goals (“Conversion Rate”). From Bernardi et al. “*150 successful machine learning models: 6 lessons learned at Booking.com.” In Proc. KDD, 2019.*

Accurate predictions are important for many uses of machine learning in production systems, but it is not always the most important goal. In many cases accurate predictions are not critical for the system goal and “good enough” predictions may actually be good enough. For example, for the audit prediction feature in the tax system, roughly approximating the audit risk is likely sufficient for many users (and for the software vendor who might try to upsell users on additional services or insurance). In other cases, marginally better predictions may come at excessive costs, for example, in terms of acquiring or labeling much more data, longer training times, and privacy concerns — a simpler, cheaper, though less accurate model might often be preferable considering the entire system. Finally, other parts of the system, such as a better user-interface design explaining the predictions, keeping humans in the loop, or having system-level non-ML safety features (as we will discuss in the Requirements and Safety chapters) can mitigate many problems from inaccurate predictions.

A narrow focus only on model accuracy that ignores how the model interacts with the rest of the system will miss opportunities to design the model to better support the overall system goals and balance various desired qualities.

User Interaction Design

System designers have powerful tools to shape the system through user interaction design. For example, Geoff Hulten distinguishes between different levels of forcefulness with which predictions can be integrated into user interactions with the system:

Automate: The system can take an action on the user’s behalf. For example, a smart home automation system may automate actions to turn off lights when the occupants leave.
Prompt: The system may ask a user whether an action should be taken. For example, the tax software may show a recommendation based on the predicted audit risk, but leave it to the user to decide whether to buy some “audit protection insurance” or change some reported tax data.
Organize: The system may organize and order items based on predictions. For example, a hotel reservation service may order and group hotels based on predicted customer preferences.
Annotate: The system may add information based on predictions to the display. For example, the transcription service may underline uncertain words in the transcript or the tax software may highlight entries that are correlated with high audit risk, suggesting but not requiring actions from the user.

These design choices decrease in forcefulness, from taking direct action, to interrupting the user with a prompt, to simply annotate information. While full automation can appear magical when it works well, many designs keep humans in the loop to better cope with possible mistakes. How forceful an interaction should be will depend on the specifics of the actual system, including the confidence of its predictions, the frequency of the interaction, the benefits of automating a correct prediction, and the cost of mistakes. We will discuss this further in the Human-AI Interaction chapter.

A smart safe browsing feature uses machine learning to warn of malicious web sites. In this case, the design is fairly forceful, prompting the user to make a choice, but stops short of fully automating the action.

Beyond forcefulness, another common user interface design question is to what degree to explain predictions to users. For example, shall the tax software simply report an audit risk score, or explain how the prediction was made, or explain which inputs are mostly responsible for the predicted audit risk? As we will discuss in chapter Interpretability and Explainability, the need for explaining decisions depends heavily on the confidence of the model and the potential impact of mistakes on users: In high-risk situations, such as medical diagnosis it is much more important that a human expert can understand and check a prediction based on an explanation than in a routine and low-risk situation as ranking hotel offers.

A model-centric approach without considering the rest of the system misses out on many important design decisions and opportunities regarding interactions with users. Model qualities, including the accuracy and the ability to report confidence or explanations, shape possible and necessary user interface design decisions and user interface considerations may influence model requirements.

Data Acquisition and Anticipating Change

A model-centric view often assumes that data is given and representative, even though system designers often have substantial flexibility in deciding what data to collect and how and though production data often drifts.

Compared to feature engineering and modeling and even deployment and monitoring, data collection and data quality work is often undervalued (explored in depth in Sambasivan et al.’ 2021 study “Everyone wants to do the model work, not the data work”). System designers should not only focus on building accurate models and integrating them into a system, but also on how to collect data, which may include educating people collecting and entering data and providing labels, planning data collection upfront, documenting data quality standards, considering cost, setting incentives, and establishing accountability.

User interaction design can influence what data is generated by the system to be potentially used as future training data. For example, providing an attractive user interface to show and edit transcription results would allow us to observe how users change the transcript, thereby providing insights about probable transcription mistakes. More invasively, we could directly ask users which of multiple possible transcriptions for specific audio snippets is correct. Both designs could potentially be used as a proxy to measure model quality and also to collect additional labeled training data from observing user interactions.

*Screenshot of the transcription service* *Temi*, which provides an excellent editor to show and edit transcripts. By encouraging users or in-house experts to edit transcripts in such an observable environment, such a system could collect fine-grained telemetry about mistranscribed words.

In general, it can be very difficult to acquire representative and generalizable training data. Different forms of data drift are common in production, when the distribution of data changes over time and may no longer align well with the training data (we will discuss different forms of drift in chapter Data Quality). For example, in the transcription service the model needs to be continuously updated to support new names that recently started to occur in recorded conversations (e.g., of new products or newly popular politicians). Also further model development and new machine-learning techniques may improve models over time. Anticipating the need to continuously collect training data and evaluate model and system quality in production will allow developers to prepare a system proactively.

Again, a focus on the entire system rather than a model-centric focus encourages a more holistic view of aspects of data collection and encourages design for change, preparing the entire system for constant updates and experimentation.

Predictions have Consequences

As already discussed in the introduction, most software systems, including those with machine-learning components, interact with the environment. They aim to influence how people behave or directly control physical devices, such as self-driving cars. As such, predictions have consequences in the real world, positive as well as negative. Reasoning about interactions between the software and the environment outside of the software (including humans) is a system-level concern and cannot be done reasoning about the software or the machine-learned component alone.

From a software engineering perspective, it is prudent to consider every machine-learned model as an unreliable function within a system that sometimes will return unexpected results. The way we learn models by fitting a function to match observations rather than providing specifications, such mistakes seem unavoidable (it is not even clear that we can always clearly determine what constitutes a mistake). Since we have to accept eventual mistakes, it is up to other parts of the system, including user interaction design or safety mechanisms, to compensate for such mistakes.

Consider the safety concerns of wrong predictions in a smart toaster and how to design a safe system regardless.

Consider Geoff Hulten’s example of a smart toaster that uses sensors and machine learning to decide how long to toast some bread, achieving consistent outcomes to the desired level of toastedness. As with all machine-learned models, we should anticipate eventual mistakes and the consequences of those mistakes on the system and the environment. Even a highly accurate machine-learned model may eventually suggest toasting times that would burn the toast or even start a kitchen fire. While the software is not unsafe itself, the way it actuates the heating element of the toaster could be a safety hazard. More training data and better machine-learning techniques may make the model more accurate and robust and reduce the rate of mistakes, but it will not eliminate mistakes entirely. Hence, the system designer should look at means to make the system safe even despite the unreliable component. For the toaster, this can include (1) that non-ML code requesting the model’s prediction caps toasting at a maximum duration, (2) that an additional non-ML component could use a temperature sensor to stop toasting at some set point, or (3) that system designers could simply install a thermal fuse (a cheap hardware component to shut off power when it overheats) as a non-ML, non-software safety mechanism to ensure that the toaster. With these safety mechanisms, the toaster may occasionally burn some toast when the machine-learned model makes mistakes, but it will not burn down the kitchen.

Another important problem where consequences of predictions show in systems are feedback loops. As people react to a system at scale, we may observe that predictions of the model reinforce or amplify behavior the models initially learned, thus producing more data to support the model’s predictions. Feedback loops can be positive, such as in public health campaigns (e.g., against smoking) when people adjust behavior in intended ways, providing role models for others and more data to support the intervention. However, many feedback loops are negative, reinforcing bad outcomes. Beyond the YouTube recommendation system suggesting more and more conspiracy theory videos mentioned in the introduction, feedback loops have been frequently argued to reinforce historic bias. For example, a system predicting more crime in an area overpoliced due to historical bias, may lead to more policing and more arrests in that area, which provides additional data reinforcing the discriminatory prediction even if the area does not have more crime than others. Understanding feedback loops requires reasoning about the entire system and how it interacts with the environment.

Just as safety is a system-level property that requires understanding how the software interacts with the environment, so are many other qualities of interest, including security, privacy, fairness, accountability, energy consumption, and user satisfaction. In the machine-learning community, many of these qualities are now discussed under the umbrella of Responsible AI. A model-centric view that focuses only on the analysis of a machine-learned model without considering how it is used in a system cannot make any assurances about system-level qualities such as safety (we will discuss this in more detail in the Safety and Security chapters) and will have a hard time anticipating feedback loops. Responsible engineering requires a system-level approach.

Interdisciplinary Teams

In the introduction, we already argued how building production systems requires a wide range of skills, typically by bringing together team members with different specialties. Taking a holistic system view of ML-enabled systems reinforces this notion further: Machine-learning expertise alone is not sufficient and even engineering skills to build machine-learning pipelines and deploy models cover only small parts of a system. Also software engineers need to understand basics on machine learning to understand how to integrate machine-learned components and plan for mistakes. When considering how the model interacts with the rest of the system and how that interacts with the environment, we need to bring together diverse skills. For collaboration and communication in these teams, AI literacy is important, but so is understanding system-level concerns like user needs, safety, or fairness.

On Terminology

Unfortunately, there is no standard term for referring to building production systems with machine-learning components. In this quickly evolving field, there are many terms and they are largely not used consistently. In this book, we adopt the term “ML-enabled system” or simply the descriptive “production system with machine-learning components” to emphasize the broad focus on the entire system, in contrast to a more narrow model-centric focus of data science education or even MLOps pipelines. The terms “ML-infused system” or “ML-based system” have been used with similar intentions.

In this book, we talk about machine learning and largely focus on supervised learning. Technically, machine learning is a subfield of artificial intelligence, where machine learning refers to systems that learn functions from data (“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” — Tom Mitchell, 1997). There are many other artificial intelligence approaches that do not involve machine learning, such as constraint satisfaction solvers, expert systems, and probabilistic programming, but many of them do not share the same challenges arising from missing specifications of machine-learned models. In colloquial conversation and media, machine learning and artificial intelligence are used largely interchangeably and AI is the favored term among public speakers, media, and politicians. For most terms discussed here, there is also a version that uses AI instead of ML, e.g., “AI-enabled system” rather than “ML-enabled system.”

The software-engineering community sometimes distinguishes between “Software Engineering for Machine Learning” (short SE4ML, SE4AI, SEAI) and “Machine Learning for Software Engineering” (short ML4SE, AI4SE). The former refers to applying and tailoring software-engineering approaches to problems related to machine learning, which includes challenges of building ML-enabled systems but also more model-centric challenges like testing machine-learned models as isolated components. The latter refers to using machine learning to improve software engineering tools, such as using machine learning to detect bugs in code or to automatically generate textual summaries of code fragments. While software engineering tools with machine-learned components are also ML-enabled systems, they are not necessarily representative of the typical end-user focused ML-enabled system discussed in this book, such as transcription services or tax software.

The term “AI Engineering” and the job title of an “ML engineer” are gaining popularity to highlight a stronger focus on engineering in data-science projects. They most commonly refer to building automated pipelines, deploying models, and MLOps and hence tend to skew model-focused rather than system-focused, though some people use the terms also with a broader meaning. The terms ML System Engineering and MLSys (and sometimes also AI Engineering) refer to the engineering of infrastructure for machine learning and serving machine-learned models, such as building efficient distributed learning algorithms.

To further complicate terminology, also AIOps and DataOps have been suggested and are distinct from MLOps. AIOps tends to refer to the use of artificial intelligence (mostly machine learning) techniques in the operation of software systems, for example, to use models to make decisions about when and how to scale deployed systems. DataOps tends to be used to refer to agile methods and automation in business data analytics.

Summary

To turn machine-learned models into production systems requires a shift of perspective that focuses not only on the model but the entire system, including many non-ML parts, and how the system interacts with the environment. This requires zooming out from a focus on model training; a broader and more engineering-heavy focus on machine-learning pipelines emphasizes important aspects of deploying and updating machine-learned components but is still model centric; considering the entire system (including non-ML components) and how it interacts with the environment is essential for building production systems responsibly. The nature of machine-learned components, which can roughly be characterized as unreliable functions, places heavy emphasis on understanding and designing the rest of the system to achieve the system goals despite occasional wrong predictions, without serious harms or unintended consequences when those wrong predictions occur.