Deploying a Model

39 min readFeb 16, 2022

This chapter covers content from the “Software Architecture of AI-enabled Systems” lecture of our Machine Learning in Production course. It focuses on deployment of a machine-learned model for inference tasks within a system, for the pipeline for training the model see the next chapter Automating the ML Pipeline, for scaling model inference in distributed systems see chapter Scaling the System, and for operations aspect chapter Planning for Operations. For other chapters see the table of content.

Production systems that use machine learning will at some point make predictions using the machine-learned models. The step of using a machine-learned model to make a prediction for some given input data is typically called model inference. Model inference is typically a function within a larger system, called by parts of the system requesting a prediction.

While seemingly simple, there are lots of design decisions and tradeoffs when considering how to deploy a model, depending on system requirements, model qualities, and interactions with other components. A common form is to provide a model inference service as a microservice, called through remote procedure calls by other components of the system, but models may also be embedded as libraries on client devices or even deployed as precomputed tables. The need to scale to many model inference requests or to frequently update or experiment with models are common drivers for architectural choices.

Example: Stock Photo Search

Consider the example of a stock photo site like Shutterstock, Pexels or Pixabay where users may look for photos to illustrate their work. Searching for images with keywords can be challenging. Often authors have to tag their images with different keywords. Here machine-learned object detection models can help to automatically identify what is in an image to enable textual search. Object detection is a fairly standard machine-learning problem where state-of-the-art models achieve high levels of accuracy. Similarly a second model could be trained to detect common landmarks of various cities. The models may be integrated into the stock photo site automatically in the background to create (hidden or visible) tags for each image or could be used to suggest tags to authors when they upload new images, among many other design choices. Either way, the models would be used as parts inside a larger system.

Model Inference Function

At their bare essentials, most machine-learned models have a really simple interface for prediction, which can be encapsulated as a simple side-effect-free function. They take as input a feature vector of values for the different features and produce as output a prediction using the previously learned model. The feature vector is typically a sequence of numbers that represent the input, say age and gender of a person or colors of pixels of an image. A machine-learned model internally typically follows a relatively simple structure composed of matrix multiplications, additions, and if statements, often in long sequences, where the constants and cutoffs in those computations (the model parameters) have been determined during model training. The resulting prediction can take the shape of a boolean value or probability score for each possible outcome (classification problems), a number (regression problems), or some text, image, or audio for various kinds of more specialized models. In our stock photo example, the object detection model would take as an input an image (a vector of numbers representing colors for the individual pixels) and possibly some other metadata (e.g., geolocation, year) and might return a probability score (and possibly bounding boxes) for each detected object. Models are typically stored in a serialized format in files and are loaded into memory when needed for inference tasks.

For example, a simple classification model taking two numeric feature inputs in scikit learn is as simple to use as:

>>> from sklearn.linear_model import LogisticRegression
>>> model = … # learn model or load serialized model …
>>> def infer(feature1, feature2):
>>>     return model.predict(np.array([[feature1, feature2]])

For object detection with TensorFlow this might look like this (see the tensorflow documentation for a more complete example):

>>> import tensorflow as tf
>>> detector_model = … # load serialized model…
>>> def detect_objects_in_img(path):
>>>     # load image
>>>     img = tf.image.decode_jpeg(tf.io.read_file(path), 
                                   channels=3)
>>>     converted_img = tf.image.convert_image_dtype(img,
                                  tf.float32)[tf.newaxis, …]
>>>    # make prediction
>>>     result = detector_model(converted_img)
>>>     return set(result[‘detection_class_entities’].numpy())>>> # example use
>>> image_url = "https://upload.wikimedia.org/wikipedia/"+
                         "commons/6/60/Naxos_Taverna.jpg"
>>> downloaded_image_path = download_and_resize_image(image_url,           
                                  1280, 856, True)
>>> detect_objects_in_image(downloaded_image_path)
{b’House’, b’Chair’, b’Flower’, b’Window’, b’Tree’, b’Toy’, b’Houseplant’, b’Building’, b’Tent’, b’Coffee table’, b’Flowerpot’, b’Plant’, b’Kitchen & dining room table’, b’Umbrella’, b’Furniture’, b’Table’}

Model inference can easily be encapsulated as a function that takes some user inputs and returns predictions in a format to be used by the application.

Feature Encoding

In almost all settings, there is a feature encoding step that takes the original input (e.g., a jpeg image, a GPS coordinate, a sentence, a row from a database table) and converts it into the feature vector in the format that the model expects. Feature engineering can be simple, such as just normalizing a value representing a person’s age to the range between -1 and 1, or can be more sophisticated, doing nontrivial transformations, nontrivial data aggregation, or even applying other machine-learned models.

In our stock photo scenario, we need to at least crop or resize the model to the size that the object-detection model expects (e.g., 300 x 300 pixels) and then convert it into a vector presenting the three color components of each pixel (e.g., a vector of 270000 bytes representing the three colors of the 300 x 300 pixels). In this example, we might additionally (1) write code to parse an image file for metainformation to extract EXIF geo-coordinates, if present, (2) use a a saliency model to identify the focus points of an image to crop it automatically before object detection, or (3) collect data about with which other stock photos a given stock photo was often viewed or “liked” together using expensive queries over large log files or databases. That is, even in our simple stock photo scenario there are several plausible nontrivial feature encoding steps.

When thinking of model inference as a component within a system, feature encoding can happen with the model-inference component or can be the responsibility of the client. That is, the client either provides the raw inputs (e.g., image files; dotted box in the figure above) to the inference service or the client is responsible for computing features and provides the feature vector to the inference service (dashed box). Feature encoding and model inference could even be two separate services that are called by the client in sequence. Which alternative is preferable is a design decision that may depend on a number of factors, for example, whether and how the feature vectors are stored in the system, how expensive computing the feature encoding is, how often feature encoding changes, how many models use the same feature encoding, and so forth. For instance, in our stock photo example, having feature encoding being part of the inference service is convenient for clients and makes it easy to update the model without changing clients, but we would have to send the entire image over the network instead of just the much smaller feature vector for the reduced 300 x 300 pixels.

Feature Encoding During Training and Inference

Importantly, the encoding of inputs as features is the same for training data during training as for inference data at runtime in production. Using an inconsistent encoding between training and inference is dangerous and can lead to wrong predictions as the runtime data does not match the training data. This can span from minor inconsistencies, such as inconsistent scaling of a single feature, to completely scrambling the inputs, such as, changing the order in which colors are encoded changes between training and inference time. Inconsistent encoding between training and inference is known as training–serving skew. To prevent this problem, a common practice is to write feature encoding code as a reusable function that is used both in training and at inference time. This could be as simple as a Python function imported both in the training and in the inference code. Ideally that code is versioned together with the model (see also chapter Versioning, Provenance, and Reproducibility).

*The feature encoding code used for training must be the same as the code used for encoding runtime data during inference on that model. It is useful to factor the shared code into a reusable library.*

Feature Stores

In larger organizations it may additionally be desirable to collect and reuse code for feature encoding, so that it can be reused across different projects. For example, multiple models may need to resize and encode images or to extract and encode GPS coordinates from image metadata. Rather than having multiple teams reimplement the same or similar feature encoding implementations, it can be beneficial to collect them in a central place and make them easily discoverable. This is particularly true for more challenging encodings that may involve machine-learned models themselves, such as using a saliency model to identify and crop the key region in an image, or encodings that require nontrivial database queries, such as collecting metadata about all other images uploaded by the same author. Documenting the different pieces of feature engineering code may further help to discover and reuse them across multiple models and projects.

In addition, if features are reused across multiple projects or need to be extracted from the same data repeatedly over time, it may be worth precomputing or caching the feature vectors. For example, if we regularly update our object detection model and reapply it to all our images in an inference step, we could store the vectorized 300x300x3 arrays after cropping rather than recomputing them every time. We would only recompute feature vectors if the data or the feature encoding code itself changes.

Many companies have adopted feature stores to solve both problems. Feature stores can catalog reusable feature engineering code and can help to compute, precompute, or cache feature vectors, especially when operating with large amounts of data and expensive feature encoding operations.

There are many competing open-source projects, such as Feast and Tecton and many commercial ML platforms have feature stores built in, such as AWS SageMaker Feature Store.

Short video from Tecton providing an overview of the different features of their feature store solution.

Model Serving Infrastructure

There are several different approaches to integrate model inference into a system. In the simplest case, the model can just be embedded as part of the application and used with an inference function as described above. The model itself may be of nontrivial size and each inference may require nontrivial computational cost, making it worth planning deployment and scalability. Often more decoupling and independent scaling are desired though for operating at scale in production with low latency, or raw power when applied over and over in a batch job with high throughput. There are many considerations when actually deploying a model inference function as part of a production system.

What makes all of this comparably straightforward is that feature encoding code and inference functions are almost always side-effect-free (also called stateless or pure). They do not store any state between multiple invocations, so all function invocations are independent of each other. This makes it easy to parallelize inference code, to answer separate requests from different processes without any required synchronization (known as embarrassingly parallel).

Common Forms of Serving a Model

At least four designs of offering model inference within a system are common.

Model inference as a library. In simple cases, the model and inference function can simply be embedded in the target system like a library. The model and inference function are loaded into memory in the component using the function and simply accessed as function calls. The model may be shipped as part of the component or loaded over the network. In some cases, such as mobile apps, a single executable may contain all application code and the model as well. There are many libraries that facilitate embedding of serialized models of specific formats, such as TensorFlow Lite (mobile apps) and TensorFlow.js (client-side browser deployment).

Model inference as a service. Likely the most common form of deploying a machine-learned model is to provide a dedicated server that responds to networked client requests over a REST API. Internally, the server will load the model into memory to answer requests quickly. As discussed, the service can receive raw inputs and perform feature encoding before inference or it can directly receive the encoded feature vectors as inputs. Multiple instances of such model inference service can be run locally or in some cloud infrastructure to scale the system, often using some container virtualization.

Batch processing. If model inference is performed over large amounts of input data, it is usually preferable to develop worker processes that can perform the inference near the data using batch processing, say as a map task in a MapReduce framework. Using inference in batch processing avoids sending all data over network connections to inference services, but instead sends the smaller models and inference code to the data. See the chapter Scaling the System for more details.

Cached Predictions. In some settings, when predictions are expensive but not many distinct inputs need to be predicted, it can be worth caching or even precomputing all predictions, such that inference can be reduced to looking up previous predictions of the model in some database. If all predictions are precomputed, often batch processing is used whenever input data or model change.

Of course hybrid approaches for all of these exist.

Model Inference as a Service in the Cloud

In recent years, much infrastructure has been developed to deploy and scale computations in the cloud, adding more computational resources as needed and matching computational needs with different hardware specifications (e.g., high memory or high CPU or available GPU). Particularly microservices and serverless functions are common patterns for deploying computations to the cloud that are a good match for providing model inference as a service.

We will explore this in more detail in chapter Scaling the System, but in a nutshell, a microservice is an independently deployable process that provides functions that can be called by other components of the system over the network (usually REST-style HTTP requests). A microservice is typically small, focusing on a single function or a set of closely related functions; it can be developed independently from other services within a system, by a different team and with a different technology stack. An inference service for a single model fits well into the idea of microservices given the very small in scope of the service — it provides a single function for making predictions. Scalability is achieved by running multiple instances of a service behind a load-balancer; common cloud computing infrastructure provides the means to automatically scale a service by elastically deploying more or fewer instances to meet the changing demand. In addition, serverless computing is a common pattern to deploy services in the cloud, where the cloud infrastructure fully automatically scales the service based on the amount of requests.

Offering model inference as a service is straightforward with modern infrastructure. Typically, developers wrap the model inference function behind an API that can be called remotely, sets up that service as a container (e.g., Docker), and deploys the service’s container to virtual machines or cloud resources.

Writing an own model inference service. While many frameworks will automate many of these steps, it is illustrative to understand how we could build our own: A simple Python program can load the model and implement the model inference function as discussed above (with or without feature encoding code, depending on the design). It can then use the library flask to accept HTTP requests on a given port, and for each request run the model inference function and return the result as the HTTP response. For our object detection setting this may look something like this following code (executed with flask run --host 0.0.0.0 --port 4040), which creates an API to which image files can be sent via post requests to http://<ip>:4040/get_objects.

from flask import Flask, escape, request
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = '/tmp/uploads'
detector_model = … # load model…# inference API that returns JSON with classes found in an image
@app.route(‘/get_objects’, methods=[‘POST’])
def pred():
    uploaded_img = request.files["images"]
    coverted_img = … # feature encoding of uploaded img
    result = detector_model(converted_img)
    return jsonify({"response":result['detection_class_entities']})

In this setting, the model is loaded only once when the process is started and can then serve inference requests with the same model. Multiple processes can be started to share the load.

Of course, such an API can also be designed to make multiple predictions in a single call to save network overhead of individual calls. For example, a client could send multiple images in a single request and receive the detected objects for all images in the result.

To make the service portable, it is common to virtualize it in a container that can be easily deployed on different machines. For example, the following Dockerfile can be used to virtualize our simple flask-based inference server:

FROM python:3.8-buster
RUN pip install uwsgi==2.0.20
RUN pip install numpy==1.22.0
RUN pip install tensorflow==2.7.0
RUN pip install flask==2.0.2
RUN pip install gunicorn==20.1.0
COPY models/model.pf /model/
COPY ./serve.py /app/main.py
WORKDIR ./app
EXPOSE 4040
CMD ["gunicorn", "-b 0.0.0.0:4040", "main:app"]

Infrastructure to deploy model inference services. Model serving is a common problem and much setup is similar between different model inference services: (1) We need to write the inference functions, which have almost always the same structure of loading a model, possibly invoking feature encoding, and passing feature vectors into the model returning the result. (2) We need to write the code that accepts REST requests and answers them by invoking the inference function. (3) We need to write the container specifications to wrap all of it and need to figure out deployment locally or in the cloud. (4) We may want to invest further in code for monitoring, autoscaling, and A/B testing. All of these tasks are fairly standard and repetitive, thus a good candidate for abstraction and automation.

There are many infrastructure projects that standardize and make it easy to deploy models without having to write own inference functions, flask code, or Dockerfiles. For example, many cloud providers provide infrastructure for deploying serverless functions that wrap code written in a programming language (e.g., Python, Go, Java, C) and wrap it with automatically generated code for serving requests through a REST API inside a container, usually already optimized for throughput, latency, robustness, monitorability, and other qualities.

Some projects specialize in the serving of machine-learned models directly, so that even the inference function would be generated automatically from metadata describing input and output formats without a single line of code. Some of these even directly integrate with model stores and discovery services that show lists of available models. Examples include BentoML (low code service creation, deployment, model registry, dashboards), Cortex (automated deployment and scaling of models on AWS) Tensorflow’s TFX model serving infrastructure (high performance framework for tensorflow GRPC services), and Seldon Core (no-code model service and many many additional services for monitoring and operations on Kubernetes). Also many commercial services compete in this space.

Deployment Architecture Tradeoffs

While the technical deployment of model inference as a library or service is fairly straightforward, choosing questions like whether to deploy as a library or service and where to deploy can be nontrivial design decisions in many systems that involve trading off various qualities.

Server-Side Versus Client-Side Deployment

First, we need to distinguish different locations where the product’s code and model inference will be executed. While there are many different and complicated possible topologies, for simplicity let’s start with (a) servers under the control of the application provider (own servers or cloud infrastructure), and (b) a user-facing device through with the user interacts with the system, such as a desktop computer, a mobile phone, a public terminal, or a smart home appliance.

In most cases, computations are split across server and user-facing device. One one end of the spectrum, all computation will happen on the user-facing device, without any connection to other servers. For example, a smart watch’s functionality to detect activities (walking, swimming, and cycling) may be executed entirely on the watch without a network connection. At the other end, almost all computations are performed on some servers and the user-facing device only collects inputs and shows outputs, as in our stock photo example where photos are stored and searched on a server and users access the web page through a browser on a laptop or a mobile app, which as little responsibility beyond rendering content provided by the server. In some settings, some computations will be performed on the user-facing device while others will be delegated to a server, such as a smart spell checker that performs checks of individual words locally on a smartphone but more complex grammar checks on a server.

Server-side model deployment. If model inference is deployed as a service on a service, other components on the server and also components on the user-facing device can perform predictions with remote procedure calls.

Server-side deployment makes it easy to provide dedicated resources for model inference and scale it independently, e.g., autoscaling with a cloud infrastructure. This can be particularly useful if models are large or have resource demands (e.g., memory, energy, GPU) that exceed those of the user-facing devices, such as performing expensive object detection or face recognition tasks. On the flip side, it places all computational cost for model inference on the server infrastructure, rather than distributing it to the user-facing devices.

For this approach to work, the user-facing devices need a network connection to the server. Hence, model inference will not work offline and network latency and bandwidth can be a concern. If inputs for model inference are large, such as images, audio, and videos, this design can cause significant network traffic, both for clients and server operators. The network latency (roundtrip time for packages on the network) adds to the inference latency of the model itself and can vary significantly based on the network connection. For example, a pedestrian detection model for a safety feature in a car can not afford to constantly send all video input to a server.

Deploying a model as a library of other server-side application code is also possible, thus avoiding remote procedure calls for model inference. For example, the model may be embedded in the same container that serves the rest of the application logic, such as performing object detection in our stock photo scenario right within the component that receives photo uploads. In practice however, most server-deployed applications need to deal with complexities of distributed systems anyway, and deploying the model as a service provides flexibility to decouple the system during operations, for example, scaling model inference and business logic independently on different servers with different hardware specifications (e.g., making use of GPUs for model inference).

Client-side model deployment. Alternatively, the model can be deployed on the client-facing device, typically as a binary that is read from a library, bundled with the application.

Client-side model deployment takes advantage of the computational resources of the user-facing device, thus saving costs for operating inference services on servers. Since no network connection is needed, model inference can be performed offline, and there are no concerns about network latency or bandwidth. Offline, on-device inference is also commonly seen as protecting privacy, since user data is not (necessarily) leaving the user-facing device. For example, we may not want to send all text processed by a smart spell checker to a company’s servers.

When models are deployed client-side, the target devices need to have sufficient computational resources for model inference; inference itself is often slower than on a dedicated server. For example, smartphones usually do not have powerful GPUs for expensive model inference tasks. If hardware of client devices is heterogeneous, for example, applications to be deployed on many different mobile phones, inference latency and user experience may vary substantially between users, leading to inconsistent user experiences. A homogeneous environment with predictable inference latency is usually only possible if the client-facing device is managed, such as public kiosks or self-driving cars, or the software is run only in select environments, such as only two product generations of smartwatches of a single manufacturer. The embedded models also increase the download size and storage requirements for the applications that contain them, for example sometimes increasing the size of mobile apps for downloads and updates substantially. For battery powered devices, energy cost of model inference may also be a challenge. For memory and computationally intensive deep neural networks, often model size is intentionally compressed after training in a step called knowledge distillation. Libraries like Tensorflow Lite and Tensorflow.js are specifically designed to execute deep neural networks on mobile devices or in browsers.

If users have access to the client device, for example, on their desktop computers or mobile phones (rather than on public kiosks), sophisticated users can gain access to the models themselves (often stored in standard formats), which can raise concerns about intellectual property and adversarial attacks (see also chapter Security and Privacy). For example, researchers have extracted thousands of Tensorflow models embedded with Android applications and web pages; with access to model internals it is much easier to craft malicious inputs than when having to repeatedly probe a server-side model inference service.

Hybrid model deployments. There are many possible designs where models are deployed both on servers and user-facing devices (or other infrastructure in a network). A common scenario is to deploy a smaller or special-purpose model on the client side and a larger model on the server, where the smaller model is used in many settings as a first step and the larger server-side model is used for more challenging inputs. For example, voice-activated smart speakers often have a small client-side model to detect an activation sequence such as “Alexa” or “Hey Siri” and send the subsequent audio with the main command to a server for analysis.

Hybrid models can be beneficial for several reasons. First, they can provide services even when offline, albeit possibly at lower quality, for example providing a basic spell checker offline but an advanced one when online. Second, they may protect privacy by processing many inputs locally on the device, for example, only sending voice recordings to a server after the activation sequence is detected by a smart speaker. Third, they can provide low-latency answers to many queries, but update them with more accurate predictions from a server-side model a little later, for example, when having to react quickly in a self-driving car. Fourth, they can balance computational and network costs, pushing some computational costs to the clients and reducing network traffic, for example only using servers when the client has a fast network connection but low computational resources.

The deployment approach of precomputed and cached predictions (see above) can be particularly useful in a hybrid setting: Instead of deploying the model to a client, only a lookup table with precomputed predictions is deployed. If an input of interest is covered by the table, the client can simply look up the result; if not, a request is sent to the server. The lookup table may be smaller than the original model and thus easier to deploy to client-facing devices; model inference is reduced to a computationally-cheap and near instantaneous lookup in a table. This design also prevents leaking model internals to anybody inspecting the application on a user’s device.

Deploying feature encoding code. As discussed above, the feature encoding functionality of converting the raw input data into a feature vector expected by the model can be done in the model inference service (or library) or delegated to the client calling the service (or library). The choice whether to compute feature encodings server-side or client-size is independent of where model inference is performed. Feature encoding itself may be computationally intensive in some settings and the encoding feature vector may be substantially smaller than the raw input, so similar design considerations regarding computational costs and network bandwidth apply.

Considerations for Updates and Online Experimentation

Decisions about where to deploy a model can influence significantly how the model can be updated and how easy it is to experiment with different model variants. Conversely, a requirement for frequent model updates or flexible experimentation can influence what deployment decisions are feasible.

Server-side updates. For server-side model deployments, it is easy to deploy a new version of a model without having to modify the clients. Assuming that input and output types of the different model versions are compatible, developers can simply replace the model behind an API endpoint or change the network configuration to route requests to an updated model inference service. For example, with an object detector deployed server side in our stock photo example, we can simply swap out the object detector without changing any other part of the system. This way it is also possible to ensure that all users use the same version of the model, even if some users use an older release of the application.

This flexibility to update the model independently from the application is also routinely used to experiment with different model variants in production by serving different users different variants of the model (again without any changes to client code) — see the chapter Quality Assurance in Production for more details. Since model inference is already performed on the server, it is easy to collect telemetry data about inference and monitor the model, including request frequency and the distribution of inputs and outputs — for example to test whether the objects detected in new stock photos correspond roughly to the training distribution or contains many unknowns.

Client-side updates. For client-side model deployments, updates and experimentation are more challenging and may require more planning. The key problems are (1) that updates may require to transfer substantial amounts of data to the user-facing devices (especially with very big models) and (2) that the devices may not all be connected with equally reliable and fast network connections and may not always be online.

There are a couple of typical design:

Entirely offline, no updates: In some cases, software is embedded in hardware, not networked, and never updated. For example, this may happen in cheap embedded devices, such as toys. Here a model may be embedded as a library and never updated. Since the system is offline, we also cannot collect any feedback about how the model is behaving in production.
Application updates: If models are embedded in the application deployed to the target device, users may be able to update the entire application including the model. This could be done manually or through an auto-update mechanism. For example, unless explicitly deactivated, most environments with app stores automatically download application updates and many applications have their own update mechanism. Each update downloads the entire application with the embedded model, ensuring that only deliberately deployed combinations of model version and application version are used in production .
Separate model updates: Some applications design update mechanisms specifically for their machine-learned models. Here it is possible to update the model without updating the application itself — or conversely to update the application without re-downloading the model. For example, a translation application may download and update translation models for individual languages without updating the app or other models.

Update strategy considerations. Several design considerations may influence the deployment and update choices.

First, users may not be willing to install updates on client devices frequently. The amount of data downloaded for updates can be a concern, particularly on slow or metered network connections. If manual steps are required for an update, it may be challenging to convince users to perform them regularly. If updates interfere with the user experience, for example, through forced reboots, users may also resist frequent updates or intentionally delay updates, as often seen with security updates. For example, Facebook decided to update their mobile apps only once every two weeks, even though they update models and application code on their servers much more frequently, because they do not want to burden users with constant large updates.

Second, applications and models may be updated at different rates. If model updates are much more common than application updates, separate model updates avoid re-downloading the application for every model change. Conversely if application code and model code are updated independently (possible both with client-side and server-side models), there are additional challenges that model and application versions may be combined in use that have not been tested together, exposing possible inconsistencies. This is particularly concerning if feature encoding code and model drift apart.

Again, notice how system-level requirements can fundamentally influence system design and can trigger challenging tradeoffs. For example, if model updates are very frequent, client-side deployments may not be feasible, but if offline operation is required it is not possible to ensure that clients receive frequent and timely updates.

Experimentation and monitoring on client devices. Experimentation is much easier with server-side model deployments, as different models can easily be served to different users without changing the application. While it is possible to deploy different versions to different client devices, there is little flexibility for aborting unsuccessful experiments by rolling back updates quickly. While it is possible to deploy multiple alternatives guarded by feature flags, this further increases the download size (especially if multiple model versions are deployed within the same application) and still relies on network connections to change the feature flag that decides which model is used.

Similarly, telemetry is much easier to collect on server-side deployments. However, unless the target device is entirely offline, telemetry data can also be collected from client-side deployments. Telemetry data is collected on the device and sent back to some server. If the device is not permanently connected, telemetry data might be collected locally until a sufficiently fast network connection is available. Furthermore, telemetry data might be curated or compressed on the device to minimize network traffic. Hence, fidelity and timeliness of telemetry may be significantly reduced with client-side deployments.

Scenario Analysis: Augmented-Reality Translation

Since all computations are performed server-side in the stock photo scenario in the first place, let us discuss a second example where computations are scattered across multiple devices. To illustrate some tradeoffs and engineering decisions necessary in the design process of ML-enabled systems, let us walk through the nontrivial example of deploying models in a scenario of augmented-reality translation on smart glasses. Consider the following scenario: You are walking around as a tourist in a foreign country and have difficulty navigating, not understanding the language. A translation app on your phone helps and you can even take pictures where the app identifies and translates text on signs. However, we want to take this a step further and project live translations over the original text directly in smart glasses, making it convenient to see translated text while moving around in the real world, as a form of augmented reality.

*Korean signs in central Seoul (by* *8minwoo). Foreigners would likely benefit from translations.*

While we are not aware of any existing products like this, the ingredients are there. For example, Google Translate can automatically recognize and translate text in images and can even do live translation based on a live video coming from the phone’s camera, replacing the original text in the filmed scene with an overlaid translated text. Smart glasses are currently probably best known from Google’s 2013–2015 Google Glass, where the video input is recorded from a camera in the glasses and images can be projected with an optical head-mounted display. While Google Glass is no longer available, currently several companies are working on related technology.

*Google Glass, 2014 (CC BY-SA 3.0* *Mikepanhu*)

Our augmented reality translation scenario is a little more complex than the simple client versus server decision discussed above. In addition to the cloud backend and the user-facing glasses themselves, typically the glasses will be connected to a smartphone over Bluetooth, which mediates access to the internet in mobile settings (the original Google Glass had bluetooth and Wifi connectivity, but no mobile data).

Also, this kind of system likely needs multiple machine-learned models. Among others, we likely need at least (1) an optical character recognition (OCR) model that detects what text is in an image and where, (2) a translation model to translate the recognized text, and (3) some motion tracking or image stabilization model that can identify how objects move in a video, so that we can move the translated text without having to run OCR and translation on every single frame of the video.

For each of these three models we will need to make a deployment decision, should we deploy it to our cloud infrastructure, to the phone, or to the glasses themselves?

As in every architectural discussion it is important to start by understanding the quality requirements for the system. Of course, we want highly accurate translation of text in the video inputs. In addition, we need to offer translations relatively quickly (translation latency) as text comes in and out of view and we need to update the location of where text is displayed very quickly to avoid motion artifacts and cybersickness — a technology-induced version of motion sickness common in virtual and augmented reality occurring when perception and motion do not align. We want translations to provide an acceptable user experience offline or at least in situations with slow mobile data connectivity. Due to privacy concerns, we do not want to constantly send captured videos to the cloud. Given that both smartphone and glasses are battery powered, we need to be cognizant of energy consumption; we also need to cope with available hardware resources in the smart glasses and with a large diversity of hardware resources in smartphones. In contrast, let us assume that we are not too concerned about operating cost of cloud resources and we do not have a need for frequent model updates (monthly updates or fewer could be enough). Telemetry is nice to have, but not a priority.

These requirements and quality goals already indicate that full server-side deployment of the OCR model is unlikely a good choice, unless we compromise on the privacy goal and the goal of being able to operate offline. At the same time, low priority for update frequency and telemetry allow us to explore client-side deployments as plausible options.

To further understand how different design decisions influence the qualities of interest, we will likely need to conduct some research and perform some experiments. This may include:

Understanding hardware constraints: The 2014 Google Glass had a 5 megapixel camera and a 640x360 pixel screen, 1 gigabyte ram, and 16 gigabyte storage. Performance testing can reveal hardware capabilities and energy consumption for model inference of various kinds of models. For example, how often can we run a standard OCR model on an image of the glasses’ camera on the smart glasses with one battery charge, how often can we do this on a low-budget smartphone?
Understanding latency requirements: We can do online research to get a rough understanding of what latency we need to pursue: 200 milliseconds latency is noticeable as speech pauses, 20 milliseconds is perceivable as video delay, and 5 milliseconds is sometimes referenced as the cybersickness threshold for virtual reality. In our augmented reality setting, we might get away with 10 milliseconds latency, but more testing is needed. It is likely prudent to conduct our own tests with simple demos or other applications on the target hardware, modifying the update frequency to test what is acceptable to test users, before designing and building the actual system.
Understanding network latency: Online documentation reveals to expect bluetooth latency between 40 milliseconds and 200 milliseconds for roundtrip messages. Latency of internet connections (wifi or mobile data) can vary significantly. We can likely get a good sense of what latency is realistic in good and less ideal settings.
Understanding bandwidth constraints: Quick online research might tell that Bluetooth has a bandwidth of up to 3 megabit per second, but that video streaming requires 4 to 10 megabit per second for low to medium quality video and high-quality still images are 4 to 8 megabyte per image.

It can be useful to create specific diagrams for architectural views for specific concerns to facilitate reasoning (see chapter Thinking like a Software Architect). For example, even a simple diagram as the following to summarize observations and measurement results can support understanding the system for discussing alternatives.

Our initial analysis reveals that full server-side analysis of video input is not realistic. We simply do not have the bandwidth and also latency is likely prohibitively high. At the same time, the low resources available on the glasses may prevent us from performing inference for all three models continuously on the glasses on every frame of the video.

Likely we want to explore a hybrid solution: Use the image stabilization model on the glasses to move displayed text with the video with very low latency. This way, we do not need to perform OCR on every frame of the video and can also accept lower latency. Translation only needs to be invoked whenever new text is detected, here local caching of results may be very effective, but we may also be able to accept latency of up to a second or two until the translation is shown on the glasses (experiments needed). Translation itself could be delegated to a server-side model, though having a client-side model (possibly smaller and with lower accuracy) on the phone or glasses allows the system to work also offline.

To decide whether to conduct OCR on individual frames of the video on the glasses or the smartphone comes down to questions of model size, hardware capabilities, energy consumption, and the accuracy of different model alternatives. It may be possible to perform detection of text in image snapshots on the glasses to crop the image to relevant pieces with untranslated text and send those to the phone or cloud for better OCR. A final decision might only be made after extensive experimentation with alternative designs, including alternative deployments and alternative models.

While we cannot make a final recommendation without knowing the specifics of models and hardware, this kind of architectural design thinking — soliciting requirements, understanding constraints, collecting information, and discussing design alternatives — is common and representative of architectural reasoning. It informs requirements and goals for the models, for example, to explore not only the most accurate models but also models that could efficiently run on the glasses’ hardware constraints. It also helps to decide the entire system design, not just model requirements, such as exploring solutions with lower-latency translations but ensuring that translations, once available, are moved with the video at very low latency to avoid cybersickness.

Model Inference in a System

Model inference will be an important component of an ML-enabled system, but it needs to be integrated with other components of a system, including non-ML components that request and use the predictions, but often also other ML components. It is worth being deliberate about how to divide responsibility in a system.

Separating Models and Business Logic

Designers of ML-enabled systems typically recommend separating models and business logic within the system. This aligns with classic approaches to system design that typically strive to separate concerns, such as the separation of frontend, business logic, and data storage in a three-tier architecture mentioned in chapter Thinking like a Software Architect. Such separation, if done well, makes it easier to encapsulate change and evolve key parts of the system independently.

In ML-enabled systems, separation between models and business logic typically comes easy because the model inference component is naturally modular and the code that implements the logic interpreting a model prediction uses different technology than the model in the first place. For example, in the stock photo scenario a component other than the model inference service would send a request to the model, apply some sanitation checks or transformation rules to the retrieved identified objects, and write results to a database. In this example, the non-ML logic could discard predictions that find river and sky in the same region of the image and it can add tags Pittsburgh and Pennsylvania whenever a landmark of the city has been identified — these are hardcoded rules written by developers (domain knowledge or business rules), not machine learned from some data. In many systems, business logic can be much more sophisticated than in the stock photo scenario — for example, a inventory management system may automate orders based on current inventory and forecasted sales based on some hand-written rules; a autonomous driving system may plan steering and acceleration based on rules integrating predictions from multiple components with hard-coded safeguards.

Similarly, the infrastructure for training a model (the machine-learning pipeline) and for collecting and processing training data can, and usually should, be separated from the non-ML parts of the system to isolate ML-related code and infrastructure and evolve it independently.

Example architecture, adapted from “Machine Learning System Architectural Pattern for Improving Operational Stability” by Haruki Yokoyama, emphasizing the separation of ML-specific and business-logic-specific components in a three-tier architecture. Notice how the inference engine is clearly separated from the business logic. Also the infrastructure for collecting data is separated and data processing is reused for serving and training.

Composing Multiple Models

Many ML-enabled systems use not just one, but multiple models. Composition comes in many shapes and forms, often nontrivial, but there are several common patterns to compose related models.

Ensembles and metamodels. Multiple models are independently trained on the same problem. For a given input, all models are asked for a prediction in parallel and the results are integrated. For ensemble models, typically the average or majority result is returned or the most conservative result — this is best known for random forest models where multiple decision trees are averaged, but this can also be used to combine different kinds of models. In our stock photo setting, we could also simply take the set union of the objects detected by multiple models. Beyond just averaging results, metamodels are models that learn how to combine the predictions of multiple other models, possibly learning when to trust which model or model combination. This is also known as model stacking. For example, a metamodel in our stock photo scenario could learn which of multiple object detection models to trust on landscape photography and which model to trust for photos with people.

*Ensemble models combine the predictions of multiple models according to an aggregation function. Metamodels combine the predictions using another machine-learned model.*

Decomposing the problem, sequential. Multiple models work together toward a goal, where each model contributes a different aspect. For example, in image captioning, an object detector might first detect objects, followed by a language model generating plausible sentences, followed by another model to pick the best sentence for the picture (see example). In our stock photo scenario, as mentioned, we might use a saliency model to identify the focus points of an image to crop it automatically before object detection. With such an approach, we decompose a more complex problem into parts that can be learned independently, possibly infusing domain knowledge about how to solve the problem and generally making the problem more approachable. For a given input, inference of multiple models is performed sequentially, where the outputs of one model are used as inputs for the next. It is also common to interleave models with non-ML business logic, such as hand-written rules and transformations between the various models. This is common in many approaches to autonomous driving, as visible in the architecture of Apollo previously shown in chapter Thinking like a Software Architect.

*Sequential composition of three models for image captioning as described in Fang et al. “From captions to visual concepts and back*.” In Proc. Conference on Computer Vision and Pattern Recognition, pp. 1473–1482. 2015. Instead of learning captions in a single step, the problem is decomposed into problems that can be learned individually.

Cascade/two-phase prediction. Multiple problems are trained for the same problem, but instead of performing inference in parallel as in ensemble models, inference is performed sequentially where the output of a model decides whether inference is performed on the next model. This is primarily used to reduce inference costs, including the cost associated with sending requests to a cloud service. A sequential two-phase design typically most predictions are made with a small and fast model, but a subset of the predictions are forwarded to a larger and smaller model. This is particularly common with hybrid deployments where the small model is deployed on an client device but the larger model in the cloud — for example, voice activated speakers often use a fast and specific model to detect an activation phrase (e.g., “siri”, “alexa”, “okay google”) on the device, sending only select audio afterward to the larger models.

Example of a two-phase prediction approach to detecting which instrument is audible in an audio snippet. A first model, small enough to be deployed in an app, classifies whether the audio contains any instrument (rather than voices or noise) and only if it detects any is a second larger model asked to identify the type of instrument. Note that the result of the first model is not used as input for the second, but it influences whether the second is asked at all. Example taken from Lakshmanan, Valliappa, Sara Robinson, and Michael Munn. *Machine learning design patterns. O’Reilly Media, 2020.*

Documenting Model Inference Interfaces

Whether provided as a library or web service, a model inference service has an interface with which users can get predictions from the model. In many systems, this interface is right at the boundary between different teams — the team that trains and operates the model and the team that builds the software using the model. These teams often have different backgrounds and need to coordinate. In some projects these teams are part of the same organization with little bureaucracy and can coordinate and negotiate requirements, but in other projects organizational distance may be high or a model created by others is integrated with little chance of coordinating with the model creators or even asking them support questions.

In all cases, documentation at this interface is important, though often neglected. Using a model without understanding its quality or limits can lead to unfortunate quality problems in the resulting system.

In traditional software components, documentation for interfaces between components (also often called APIs) is more common. Traditionally, interface descriptions include the type signature, but also a description of the functionality of the component’s functions (typically textually, but also more formal descriptions of pre- and post-conditions are possible). Qualities of traditional software components may be part of the interface description or described in service level agreements (SLA). For example, a service level agreement may characterize expected latency or throughput of a component’s function, availability if hosted as a service, or reliability.

{
  "mid": string,
  "languageCode": string,
  "name": string,
  "score": number,
  "boundingPoly": {
    object (BoundingPoly)
  }
}

Excerpt of the kind of technical documentation often used to describe the interface of a model inference service, here from Google’s public object detection API. The API describes the URL for the endpoint, authentication, and the format in which the request need to be sent (JSON object in which images are base64-encoded strings) and that the API will return identified objects as LocalizedObjectAnnotation object shown in this figure, describing the object, confidence, and bounding box.

Documentation for model inference components and their quality attributes. In current practice, model requirements may be discussed among stakeholders but are rarely explicitly documented. While it is fairly straightforward to give a technical explanation of how to invoke a model inference function or service, what parameters it expects, and what it returns, it can be very challenging to provide comprehensive documentation.

The intended use cases of a model and its capabilities may be obvious to the creator but not to an application developer looking for a model to support a specific task. For example, an object detection model in our stock photo scenario may be intended by its developers for detecting everyday objects but not landmarks and it should not be relied upon to distinguish objects such as weapons in mission critical settings. Such a description can also foster deliberation about ethical concerns of using the model.
The supported target distribution for which the model was trained is often difficult to capture and describe. The target distribution is roughly equivalent to pre-conditions in traditional software and asks what kind of inputs are supported by the model and which are out of scope. For example, should the object detector work on all kinds of photographs or does it expect the object of interest to be well lit and centered? Is it intended to work on drawings too? It is common to describe the target distribution with a brief textual characterization or indirectly by describing the distribution of training data.
As we will discuss in later chapters, accuracy of a model is difficult to capture and communicate. Yet interface documentation should contain some description of expected accuracy for difficult use cases or target distributions. Typically this requires a description of how the model was evaluated (especially on what data). This might include separate reporting of accuracy for different kinds of inputs, such as separate accuracy results for a facial recognition model across gender, age, and ethnic groups.
Latency, throughput, and availability model inference services can be documented through traditional service level agreements if the inference component is hosted as a service. If it is provided as a library, documenting timing-related qualities and resource requirements could provide useful information for users.
Other technical model qualities such as explainability, robustness, and calibration should be documented if available and assessed.
Considering the various risks of using machine learning in production discussed later, responsible engineers should consider ethics broadly and fairness, safety, security, and privacy more specifically. Communicating considerations and evaluations at the model side, helps users when making similar considerations for components that use the model and the system as a whole.

Academics and practitioners have proposed several formats for documenting models. Best known is likely the Model Cards proposal out of Google’s Ethical AI group: It suggests a template for a brief (1–2 page) description of models including intended use, training and evaluation data, considered demographic factors, evaluation results broken down by demographics, ethical considerations, and limitations. This proposal particularly emphasizes considerations around fairness (discussed in later chapters). The FactSheets proposal from IBM similarly recommends a template for model documentation including intended applications, evaluation results, and safety, explainability, fairness, security, and lineage considerations.

Example of a 1-page model documentation following the model card template, from the original model card paper: Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. “*Model cards for model reporting.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229. 2019.*

*Excerpt of the Limitations section of the model card documenting Google’s public Object Detection API at* *https://modelcards.withgoogle.com/object-detection. The same model card also breaks down prediction accuracy of the model by many different criteria, helping users assess whether the model fits their use case.*

At the time of this writing, documentation practices for models are fairly poor. Our own interviews with practitioners showed that model inference components are almost never documented explicitly in production projects within and between organizations, but that frictions between teams developing the model and teams using the model abound at this interface. Public APIs for models, such as Google’s Object Detection model, usually document how to invoke the service and what data types are transferred well, but only few APIs provide detailed information about other aspects, including intended use, target distribution, evaluation, robustness, and ethical considerations. The model card and fact sheet proposals for model documentation are frequently discussed but only very rarely adopted — our own research showed that aside from less than one dozen flagship examples, even when the proposed model card template is adopted, the documentation is often shallow.

We believe that documenting models (and other ML components) is one of the big open practical challenges in building ML-enabled systems today. Documentation is undervalued but important at this important coordination point between teams. The inductive nature of machine learning without clear specifications (see chapters Introduction and Model Quality: Defining Correctness and Fit) raises new challenges when describing target distributions and expected accuracy and also many additional concerns around fairness, robustness, and security that can be challenging to capture precisely — if they have been considered and evaluated at all.

Summary

Conceptually, model inference is straightforward and side-effect free, and hence easy to scale given enough hardware resources. Careful attention should be paid to managing feature-encoding code to be consistent between training and inference, here feature stores provide a reusable solution. Wrapping a model with a REST API and deploying it as a microservice on a server or in a cloud infrastructure conceptually straightforward and a large number of infrastructure projects have been developed to make deployment and management of the service in the cloud easier. With the right infrastructure, it is now possible to deploy a machine-learned model as a scalable web service with a few lines of configuration code and a single command-line instruction.

Beyond just infrastructure, deciding where to deploy a model, how to integrate it with non-ML code and other models, and how to update it can be challenging when designing ML-enabled systems. There are many choices that influence operating cost, latency, privacy, ease of updating and experimentation, the ability to operate offline, and many other qualities. Sometimes design decisions are obvious, but in many projects they may require careful considerations of alternatives, involving experiments with alternatives in early prototyping stages to collect data before committing to a design. Stepping back and approaching this problem like a software architect fosters design thinking beyond local model-focused optimizations.

Finally, many of the concepts and solutions discussed throughout this chapter have been described as patterns. In writings about this you may encounter the serverless serving function pattern (i.e., deploying model inference in an auto-scalable cloud infrastructure), the batch serving pattern (i.e., applying model inference at scale as part of a batch job), and the cached/precomputed serving pattern (i.e., using tables to store predictions for common inputs) as deployment approaches, corresponding to the approaches discussed above. The idea of feature stores is occasionally described as the feature store pattern. The two-phase prediction pattern and the decouple-training-from-serving pattern refer to approaches to compose models and non-ML components discussed above.