Architectural Components in ML-Enabled Systems

13 min readFeb 11, 2022

This chapter introduces and organizes the various considerations that go into the design of ML-enabled systems and all their components. It sets the stage for the deeper dive into quality tradeoffs, model serving, ML pipelines, MLOps, distributed systems, and more in subsequent chapters that can be found in the table of contents.

Components of a system and their assembly.

There are many texts that cover how to implement specific pieces of software systems and ML pipelines. However, when building production systems with machine-learning components (and pretty much all software systems really), it is important to plan what various pieces are needed and how to fit them together into a system that achieves the overall system goals. Programmers have a tendency to jump right into implementations, but this risks building systems that do not meet the users’ needs, that do not achieve the desired qualities, do not meet the business goals, and that are hard to change when needed. Machine learning introduces additional complications with various additional ML components and their interactions with the rest of the system. The importance of taking time to step back and plan and design the entire system is one of the key lessons of software engineering as a discipline (and engineering more broadly) — such emphasis on design is arguably even more important when building systems that use machine learning.

Every System is Different

Ideally, we would like to describe a common architecture of an ML-enabled system, but systems differ widely in their needs and quality expectations. Hence designers and engineers need to understand the specific requirements of a particular problem and design a solution that fits this problem. To illustrate the point, let us just contrast four different examples:

A personalized recommendation component for a music streaming service can deploy the model as a microservice part of a larger system of many non-ML components. User activities on the service provide plenty of data to train the model and observe how well the recommendations keep users engaged. To serve the large number of recommendation requests with reasonable response time, each of which is relatively cheap to compute, the recommendation services can be hosted with many instances in some cloud infrastructure. The recommendation model itself will be regularly updated without human intervention. Being in an online environment with millions of users makes it easy to experiment with different models. If the recommendation component fails, it is easy to instead provide as a backup cached previous recommendations or non-personalized global top 10 recommendations.
The transcription service from the introduction, will also be hosted in a cloud-based environment. Each user only occasionally interacts with the system in irregular intervals, uploading audio files when needed. Here, the used model is large and model inference for actually transcribing audio is computationally expensive, but not very time-sensitive — a working queue is likely a better fit than trying to return results immediately when audio files are uploaded. Model updates are not as frequent, so human steps and manual testing in the update and deployment process may be acceptable if safeguards against mistakes are in place.
A self-driving car typically uses dozens of models that need to work on on-board computers in real-time. Locally available hardware sets hard constraints on what is computationally feasible. Mistakes can be fatal, so non-ML components will provide significant logic to ensure safety, interacting closely with the ML components. While model updates are possible, their rollout will likely be slow and inconsistent across the entire fleet of cars. Experimenting with different model versions in practice is limited by safety concerns, but lots of data can be collected for later analysis and future training — so much data in fact that collecting and processing the data may be a challenge.
A privacy-focused smart keyboard on a mobile phone may continuously train a model for autocompletion and autocorrection locally on the phone. Here, not only inference, but also data processing and training happens locally on a battery-powered device, possibly using novel federated machine-learning algorithms. When collecting telemetry to monitor whether updates lead to better predictions, developers have to make careful decisions about what data to collect and share.

These few examples already illustrate many kinds of differences that will inform many design decisions, including (1) different degrees of importance of machine learning to the system’s operation, (2) different frequency, latency requirements, and cost of model inference requests, (3) different amount of data to process during model inference or model training, (4) different server-side or client-side deployment of model inference and model training, (5) different frequency of model updates, (6) different opportunities and requirements for collecting telemetry data and conducting experiments in production, and (7) different different levels of privacy and safety mechanisms. As a consequence, these systems all face very different challenges and will explore different design decisions that serve their specific needs.

With every system being different, designers, software architects, and engineers generally now have the responsibility to design the solution to fit the problem. Software architecture and design is the process of understanding the different design choices and common solutions and reasoning about which of them fit for the specific needs of the system at hand. Instead of starting entirely from scratch, skilled software architects will investigate the needs of a system and build on knowledge and experience of common designs and their tradeoffs to customize a solution for a given system. For example, for a system needing to achieve massive scale but not immediate responses, they may build on distributed data storage and stream processing infrastructure. To this end, it is important to understand the various design decisions in building ML-enabled systems, the common solutions and their benefits and drawbacks. Throughout the next chapters, we will explore many of those.

Common Components in ML-Enabled Systems

As discussed in the chapter From Models to Systems, a machine-learned model is typically a component among many other components in an ML-enabled system. In the following, we provide a brief overview of common components that are used as building blocks of ML-enabled systems.

Machine learning in ML-enabled systems typically manifests in two key forms:

A model inference service called by other parts of the system to use a machine-learned model to make predictions for input data. For instance, in the transcription service scenario, a model inference service turns audio into text.
The machine-learning pipeline used to train or update the model. In the transcription scenario, this would be all the code and infrastructure to train the transcription model from training data.

In addition, there are usually multiple additional supportive components to collect data, to share functionality used by both model inference service and ML pipeline (e.g., a feature store), to store and process data at scale, to monitor the model inference service and the ML pipeline, and various non-ML components for processing predictions by the model inference service (e.g., for safety checks) and for interacting with the environment (e.g., through a user interface or actuators).

*Extended architecture sketch of the transcription system from chapter* *From Models to Systems*. Instead of a single ML component (the model inference service “Speech Recognition”) it now also includes the pipeline to train and deploy model updates, shared code between training and inference in the feature server, the user interface and an internal labeling tool as the source of training data, some monitoring infrastructure for the model, and infrastructure for large scale data storage and processing.

Of course, many ML-enabled systems use not only a single model, but may compose multiple models to achieve a single larger task or use multiple models for different tasks. For example, implementations for autonomous driving are often broken down into multiple specialized models for detecting traffic lights, street markings and pedestrians, and the predictions of those models are combined for a larger task either through handwritten non-ML logic or through other machine-learned models. In such cases, a system would usually have multiple model inference service components and multiple ML pipeline components, among all the other components of the system.

Data Sources

Machine-learning requires data. Depending on the system, the data can come from many different origins, such as manual entry and collected telemetry.

In some cases data may be collected, curated, and labeled manually, for example, when nurses are asked to enter additional data into electronic records when seeing patients in a hospital. Also crowdsourcing data or labels is common, paying many workers small amounts for each contribution or even incentivizing users to enter data voluntarily for free, for example, through various incentives or gamification. For manual entry and labeling, considering user interface design, process integration, incentives, and quality control is essential.

To collect data at scale, especially in systems with many users, it is common to collect data during the system’s operations, especially from how users interact with the system and from sensors observing the environment. This includes specifically designed telemetry data, where the system is specifically instrumented to collect specific data, as well as log data that was previously collected without a specific machine-learning goal in mind. Typical challenges around telemetry and log data include massive data volumes, proxy measures rather than direct observations in noisy data, and privacy considerations. Designing telemetry collection to gather training data and insights into model and system quality in production is a key design challenge of many ML-enabled systems.

We will discuss telemetry design primarily in chapter Quality Assurance in Production, but it will also come up in many other design discussions in chapters Automating the ML Pipeline, Scaling the System, and Planning for Operations. Except for a brief discussion in chapter Data Quality, we will largely skip a discussion of manual forms of data collection and suitable interfaces and processes, but recommend the Data Cascades paper as a good starting point for further readings on this topic.

Automated ML Pipeline

Training a model consists of many steps, including data cleaning, feature engineering, selecting and tuning machine-learning algorithms, and evaluating the trained models. Depending on the amount of data and the activities involved, each step can be computationally expensive, beyond the scope of a single machine. Designing a system to allow it to be changed more easily is a key focus in traditional software engineering, and it seems prudent also to make this a focus in ML-enabled systems, where data scientists may continuously improve models or models may need to be updated with new data. When models are only updated occasionally, performing manual steps to package and deploy models may be acceptable, but when updates happen frequently or the project wants to experiment with different models in production, more automation of the machine-learning pipeline is desired. Ideally, all steps from loading the data to deploying a tested model are fully automated. Versioning and tracking data provenance may be an important consideration too to facilitate reproducibility, accountability, and debugging.

We will discuss the choices particularly around automating the ML pipeline in chapter Automating the ML Pipeline.

Model Inference

Conceptually machine-learned models are just functions that are called by other parts of the system to make predictions. The component offering this function to other parts of the system is called model inference service. There are many different deployment options for a model, from embedding it directly into an application, to using it within a batch job over massive amounts of data, to deploying it as a web service with many instances in a cloud infrastructure. Different deployment decisions have different consequences on inference latency, hardware requirements, scalability, cost, observability, the ease of updating, and the ease of experimenting in production. In addition, model predictions need to be integrated with non-ML business logic and safety mechanisms within the system and often multiple models need to work together. With the inductive nature of machine learning, documenting model inference functions is notoriously challenging.

We will discuss the various design options and tradeoffs regarding how to deploy model inference services in the Deploying a Model chapter.

Feature Store

The machine-learning pipeline and the model inference service often share some code and perform the same preprocessing steps for data cleaning and feature engineering on the data. In some cases these preprocessing steps may involve the collection of data from various systems or substantial computations. Some computed features may also be reused by various multiple models. Feature stores are a common infrastructure component to share data transformations and handle those transformations at scale.

We will discuss feature stores as part of the Deploying a Model chapter.

Data Management

While also software systems without machine learning have often handled large amounts of data, machine learning pushes data to the forefront and often encourages the collection of large amounts of data, just in case that some of it might be useful for future analyses or models. With this, large scale data storage and data processing has gained in importance, as has the job role of data engineers who specialize in designing and operating data storage and processing systems. Distributed data storage and distributed batch or stream processing systems are important infrastructure components used as building blocks in many ML-enabled systems.

The chapter Scaling the System will discuss basic building blocks for distributed data storage and processing.

Non-ML Components

Finally, every ML-enabled system will also have non-ML components. Even in the simplest systems built around a machine-learned model, such as an a sales prediction system or a smart thermostat controller, there will usually be at least some non-ML code to gather input data (e.g., user inputs, databases, sensors), to then invoke model inference, and to finally present the result or initiate an action, even if just as simple as sending an email or turning on a switch.

To build systems that are usable, safe, and secure, such non-ML components are essential. For example, developers can deliberately design how the system interacts with users and the environment in a way that wrong model predictions do not lead to disasters. Typical ML-enabled systems also integrate substantial non-ML infrastructure for deploying and monitoring the various components.

In chapter Human-AI Interaction we will discuss some considerations for user interface design. The later chapter on Safety will dive deeper into (usually non-ML) safety mechanisms. In chapter Planning for Operations, we will discuss monitoring infrastructure and other system design decisions that facilitate operations of the various components and the entire system.

Common System-Wide Challenges in Designing ML-Enabled Systems

There are many design challenges in every system and the specifics will be different from system to system. Yet there are some common themes prominent in many ML-enabled systems that require the deliberate design of multiple components and will hence appear repeatedly in discussions of specific design decisions.

The following design considerations will feature prominently in the following chapters:

Separating concerns and understanding interdependencies: The various ML and non-ML components in a system interact, often in subtle ways. For example, predictions of a model are used in non-ML business logic, influencing user interactions, which produce telemetry, which may be used for monitoring and updating models. Also, local design decisions in individual components influence other components, such as a small enough trained model enabling different deployment choices than a large one. Designers need to plan how machine-learning components are used within a system by other components, for example, to understand and limit the consequences of poor predictions. Importantly, needs and interdependencies may evolve over time, as system requirements and understanding of what is technically feasible evolve. Designers need to assign responsibilities and understand interdependencies, commonly through identifying, documenting, monitoring, and enforcing data flows and qualities at interfaces between components.
Facilitate experimentation and updates with confidence: Designers need to navigate the tension between exploratory model development and robust deployments. Many ML-enabled systems expect frequent updates of ML and non-ML components. Preparing for experimentation and updates can provide key advantages, but also puts a larger emphasis on rigorous engineering, quality assurance, and operations practices. It usually comes with substantial infrastructure investment for building robust, automated, and modular machine-learning pipelines and infrastructure for conducting experiments and monitoring outcomes in production.
Separating training and inference and closing the loop: Most systems deliberately separate model training from model inference, while managing shared feature engineering code. In addition, many modern ML-enabled systems are supposed to continuously learn and continuously collect data, often from user interactions in the system that the machine-learned models help to shape. When operating in a continuously evolving system, automation and observability become essential: Designers need to focus on telemetry design and automate the entire machine-learning pipeline to continuously collect data, to continuously update models, and to continuously observe whether the system operates as intended.
Learn, serve, and observe at scale or with resource limits: Many ML-enabled systems operate with very large amounts of data and should be deployed at massive scale in the cloud, whereas others may be deployed on embedded devices with restricted resources. In the former case, scalable designs using distributed systems are common to balance latency, throughput, and fault tolerance. In the latter, creative solutions may be needed to operate with low resources and possibly even entirely offline.

Summary

There is no single one-size-fits-all design for software systems, but there are many common solutions to recurring design challenges from which skillful developers can choose. The key is understanding the needs of a system as a whole and the needs of its components and making informed design decisions to support these needs. Here, we outlined the various additional components that can often be found in ML-enabled systems that should be considered when designing such systems, including components for data collection, for training and serving models, and for data management. We also emphasized common design challenges in ML-enabled systems around managing component interdependencies, supporting experimentation, closing the loop, and operating at scale, which will be common themes in the following chapters.