Planning for Operations of ML-Enabled Systems

29 min readMar 1, 2022

This chapter covers content from the “Infrastructure Quality, Deployment, and Operations” lecture of our Machine Learning in Production course. For other chapters see the table of content.

The term operations describes all tasks, infrastructure, and processes involved in deploying, running, and updating a software system and the hardware it runs on in production. Operations is often considered a distinct activity from the development of the software and often performed by team members with different expertise. Operating a software system reliability at scale often takes a lot of skill, preparation, and infrastructure. Systems that use machine learning introduce additional tasks, such as making sure that model training and model inference operate well. They usually put more emphasis on moving and processing very large amounts of data when operating the system. They also often emphasize updates and experimentation more heavily than traditional software systems due to constant exploration in developing models and as a reaction to drifting data.

As the community collectively has gained and shared experience on how to operate software systems and a lot of reusable infrastructure is available, reliable operation has become attainable. Nonetheless, the learning curve can be steep and the necessary infrastructure investment can be extensive. When ML-enabled projects reach a certain scale or complexity, organizations should consider bringing in operations specialists, who may have role titles such as system administrators, release engineers, site reliability engineers, DevOps engineers, MLOps engineers, data engineers, and AI engineers.

Since operations benefits from certain design qualities of the software system, supporting future operations is another important design task that should be considered early on in a system’s development, just as the other design and architecture considerations discussed before. Building a system that is hard to scale, hard to update, and hard to observe will be an operations nightmare. In contrast, by building observability into the system’s infrastructure early on and virtualizing the software for easy deployment, the operations team will have a much easier time and can focus on providing value back to developers in terms of fewer interruptions or faster experimentation.

Again, following the theme of T-shaped team members (see chapter Introduction), software engineers, data scientists, managers, and other team members in ML-enabled systems benefit from understanding key concerns of operators that allows them to anticipate their requirements, involve them early on, and design the system to be more operations friendly. Conversely, operators need a basic level of literacy in machine-learning concepts to select and operate infrastructure to support machine-learning components.

While we could dedicate an entire book on operating systems generally and ML infrastructure specifically (and several books cover many of these topics in depth, see further readings at the end), here we provide only a brief overview of key concepts and how they influence design and architectural considerations for the system, with pointers to other materials to go deeper. We will briefly return to operations later when we discuss teamwork and cultural aspects in chapter DevOps and MLOps Culture.

Case Study: Blogging Platform with Spam Filter

As a running example, let us use a hosted blogging platform and a spam filter based on some advanced machine-learning algorithm. The system is developed as a microservice architecture with different services for showing a blog, for adding and editing posts, for comments, for user accounts, and for our spam filter.

Service Level Objectives

As usual it is a good idea to start with requirements, and then design key aspects of the system needed to achieve these requirements, before diving into infrastructure and implementation and actual day-to-day operation. Based on system requirements, we can identify quality requirements for operations of the system. Such requirements are typically stated in terms of service level objectives that describe desired qualities. Typical service level objectives for operating a service include maximum response latency, minimum system throughput, targeted availability or maximum error rate, and time to deploy an update. For storage systems, we might additionally set durability objectives, for big data processing systems there might be throughput and job latency objectives. Notice how operators can usually not achieve these goals alone but depend on the capabilities of the software they deploy and operate, hence collaboration with developers and data scientists is essential to deploy systems that achieve their service level objectives reliably in operation. Metrics and tools to measure these objectives are usually well established and often stated as averages or distributions over time periods, such as “able to serve at least 5000 requests per second” or “99 percent of all requests should respond within 100 milliseconds.”

Service level objectives can and should be defined for the system as a whole as well as individually operated components. For example, our blogging platform will care about response latency and availability for serving the web site to users, but also about the response latency and availability of the spam filtering service. Service quality measures of system and component may be related, but do not have to be, since error-handling mechanisms can preserve system operation even when individual components fail. In the blogging platform, the system can keep operating while the spam filter component is down, either showing or hiding new comments by default, processing missed comments when the spam filter component is back up. Anticipating such failure helps to recover from them, for example by having already designed a queue of messages to be filtered when the filter is not immediately available.

Service level objectives are typically negotiated and renegotiated between operators and other stakeholders, reasoning about constraints, priorities, and tradeoffs as usual, often after some research and experimentation in the design phase of a project (see chapter Thinking like a Software Architect). For example, the increased infrastructure cost to increase availability from 99.9 to 99.99 percent (i.e., from 9 hours to 52 minutes of acceptable downtime per year) in the blogging example would simply outweigh the benefit; even lower availability might be acceptable for this service if it allows the operations team to focus on more important goals. Operations is often portrayed to focus on risk management, rather than on total risk avoidance.

In more formal arrangements, service level objectives may be codified as contracts called service level agreements, which typically include a description of consequences and penalties if the objectives are not met. These may be included as interface documentation between components, including the documentation of a model inference service discussed in chapter Deploying a Model.

Observability

Being responsible for keeping systems running and performing well with changing demands, operators need visibility into how the system is doing. This is often described as monitoring the system and broadly captured by the term observability.

Most monitoring infrastructure supports monitoring the status of hardware out of the box, such as CPU and memory use and uptime. Beyond that, operators usually want to observe more application-specific and component-specific information. At least, a system will typically automatically collect and monitor data for the service level objectives, such as response time, throughput, and availability of the spam filter component. More sophisticated monitoring will observe the system at a more granular level of individual components or even internals of components, which can be useful to identify bottlenecks, unreliable components, and other problems during operation. For example, we might identify which other components call the spam filter how often and to what degree slow responses from the spam filter slow down other parts of the system.

*Dashboard showing latency and throughput of two instances of a service and CPU load of three machines over the last 12 hours, made with* *Grafana*.

Observing a system typically consists of three distinct steps: Instrumenting the system to produce telemetry signals that can be monitored, collecting and storing the telemetry signals, and analyzing the telemetry signals:

Producing telemetry with instrumentation: In the simplest case, the application code simply includes print statements to write information into log files. For example, the spam filter component in our blogging scenario could print a log entry for each analyzed comment, including analysis time and predicted spam score. In addition, various libraries for software metrics and telemetry provide APIs to record counters and distributions in an efficient way, such as Prometheus and OpenTelemetry; our spam filter might simply increase a counter for each request and for each predicted spam message. Much infrastructure used in modern software systems already has built in capabilities to log important events. For example, web servers and stream processing infrastructure will log many events out of the box that may provide useful insights. Similarly, most libraries used for remote procedure calls between microservices, such as Uber’s TChannel or Twitter’s Finagle, will record information about all calls that can later be used to analyze call structures between services; adopting such a library for all calls between microservices in our blogging scenario could produce a lot of useful telemetry almost for free.
Collecting telemetry: Complex systems may create a lot of signals (log files, metrics, traces) in many forms and in many locations. In our blogging scenario, multiple instances of various components each export their own logs and metrics. Many infrastructure projects have been developed to collect and store such signals in a unified place for subsequent analysis. For example, Prometheus will regularly request updated metrics from all configured systems (pull design) and store them in a time-series database and statsd acts as a server receiving metrics from systems producing them (push design); Apache Flume provides a system to collect and aggregate log files at scale from many sources; and LogStash provides features to ingest signals from many different sources at scale and process them before passing them on for storage or analysis to a different system. The focus in the collecting phase is typically on moving large amounts of telemetry data efficiently and with low latency from many sources to a place where it can be analyzed.
Analyzing telemetry: Many different forms of analyses can be performed on the collected signals. Typically, information is extracted from the raw signals, such as computing the service level measure of average response time from a server log, just in time as new lines from the log file arrive. The most common forms of analysis are real-time trends of measures shown in dashboards and automated alerts that are triggered when the observed values exceed certain thresholds, such as calling a developer’s cell phone if the entire blogging service is unreachable to end users for more than 90 seconds. Grafana is a popular choice for creating dashboards and alert systems, but many other projects exist, often integrated with infrastructure for telemetry analysis. If trace information of system internals is collected, analyses can also show detailed information about which services frequently call which other services and where bottlenecks are, as discussed in the performance monitoring discussion in chapter Scaling the System.

Modern systems to collect and analyze telemetry are flexible to allow adding metrics later and to scale the system when needed, but identifying early on what analyses should be supported and planning instrumentation early on will make downstream analysis steps much much simpler. For example, observability can be baked into the system by building on libraries (e.g., for remote procedure calls) that support logging or tracing out of the box, whereas switching to such a library later might be a substantial undertaking. Investing in observability early in a project also fosters a culture of data-driven decision making and can help to detect potential problems even in the very early releases of a project.

A solid telemetry and monitoring system is usually a strong asset for many aspects of system development, not just for operators to identify outages and bottlenecks in production. Developers typically value insights into how systems perform in production and what features are used. For example, telemetry might show how many comments are filtered on average and whether comment spam associates with certain times of day or certain topics. Telemetry can be used to measure how the system achieves some of its objectives, such as monitoring how many users sign up for or renew their paid plan on the blogging platform. And telemetry is the key to test and experiment in production, both for ML and non-ML components, as we will discuss in chapter Quality Assurance in Production.

Release Management and Automating Deployments

Unscheduled patches and partial updates in production to fix a problem live in production are an operator’s nightmare. It is easy to break the system with a hotfix applied inconsistently or without sufficient testing. Deploying updates has long been a pain point during operations that caused conflicts between developers and operators.

Many different practices, often under the umbrella of DevOps and MLOps, help to reduce friction of deploying new releases, but they need to be planned and infrastructure needs to be set up for this.

Version control. It is standard these days to develop software in version control systems to track all changes. Version control ensures that files are in a consistent state and that changes can be rolled back to older revisions if needed. Version control also increasingly includes dependencies and configuration files, such as requirements.txt files describing versioned package dependencies in Python or Dockerfile files declaring how to package software. In machine-learning contexts. With the introduction of machine learning, there is also a push to version data, machine-learning pipelines, and models as we will discuss in chapter Versioning, Provenance, and Reproducibility.

Feature branches and feature flags. At the level of version control systems it is common to separate current development from releasable and tested code, for example, such that the latest revision of the main branch could be pushed into production at any time or such that releasable code is always in a release branch. Unfinished, unreviewed, or insufficiently tested development must hence be performed in a separate development branch or guarded by feature flags (if statements in the program, disabling the execution of the code, see more in chapter Quality Assurance in Production).

Illustration of trunk-based development with feature branches. Ideally the main branch always contains releasable code and all development happens in separate feature branches and is merged when ready for release; actual releases are tracked in a separate branch. Many variations of these practices exist. Based on Figures by *Vincent Driessen* *(CC BY-SA)*

Test automation. Continuous integration infrastructure makes sure that all code changes integrated into the version control system can be built independently (not just on the developer’s machine), pass all tests, and pass all other automated quality checks (e.g., static analysis). This way, it is not possible to “forget” to include a dependency or “forget” to execute tests before a commit, even when in a rush. We will cover build and test automation more in chapter QA automation.

Automate deployments. Moving from continuous integration to continuous deployment aims to automate all steps from pushing code to a version control system to seeing it deployed on a server, such that no manual steps remain in the release process. For the spam filter component, this could include compiling the serving code, packaging it with the model in a container, running regression tests, copying it to a server, from where it is pushed to a cloud service, before finally adjusting the network configuration to route requests to the new service revision. With automation, some organizations release new code within minutes of it being pushed to version control, providing developers with the satisfaction of seeing their work in production and fostering rapid feedback and experimentation. Beyond automated testing, continuous deployment typically involves automatically building deployable packages or containers and pushing them to production machines. Typically canary release infrastructure (discussed in chapter Quality Assurance in Production) will gradually roll out releases, observe the system through telemetry, and make automated decisions about whether to continue the rollout or revert back to the last stable release. Recently, as discussed in chapter Serving a Model, many MLOps tools such as BentoML, Cortex, and many others, support automation of packaging and deployment of machine-learned models.

Infrastructure as Code and Virtualization

Installing software with many dependencies on a new machine can be challenging. In the past operators have often complained that they failed to install a release that worked fine on the developers machine, due to unnoticed and undocumented dependencies or system configurations. It is even harder to ensure that hundreds or thousands of machines in a cluster or cloud deployment have consistent configurations, including the same version of the code, of the model, and of all libraries, but also the same configuration of service accounts, file permissions, network services, and firewalls. Planning deployment early on and preparing it with suitable infrastructure technologies can substantially reduce friction during operation, which again enables more frequent releases and thus enables more experimentation and more rapid feedback.

Virtualization. Virtualization technologies, especially modern lightweight containers like Docker, have significantly helped with reducing friction between developers and operators. Instead of operators deploying a binary to many machines and setting system configurations on each machine, developers now routinely set up, configure, and test their code in a container. In this process, they essentially install their code on a minimal fresh operating system and would notice immediately missing dependencies or incompatible versions. Also system configurations like file permissions are set only once when creating the container. The operator then can take this container as a single large binary artifact and push copies to the machines that will run it, focusing on hardware allocation and networking and not installing libraries. All machines that execute a container share the same consistent environment within the container. Furthermore, the code in the container is entirely separated from other code that may be run on the same machine. Also rolling back a release is as easy as switching back to the previous container. Containers come at the cost of some small runtime overhead and some initial complexity for setting up the container.

As an example, consider the Dockerfile from chapter Deploying a Model, shown again below, which looks similar to what we would expect for serving the spam filter component in our blogging scenario. It specifies explicitly which version of Python to use, which dependencies to install in which version, which code and model files to deploy, how to start the code, and what network port to expose (all others are blocked). The operator does not need to figure out any of these tasks or manage version conflicts because other language or library versions may already be installed on the system.

FROM python:3.8-buster
RUN pip install uwsgi==2.0.20
RUN pip install numpy==1.22.0
RUN pip install tensorflow==2.7.0
RUN pip install flask==2.0.2
RUN pip install gunicorn==20.1.0
COPY models/model.pf /model/
COPY ./serve.py /app/main.py
WORKDIR ./app
EXPOSE 4040
CMD ["gunicorn", "-b 0.0.0.0:4040", "main:app"]

Just as the application code and individual components are virtualized with containers, it is common to also virtualize other infrastructure components, such as databases, stream processing engines, or metrics dashboards. Many providers of such infrastructure components already offer ready-to-use containers.

Notice how with containers, developers take on more responsibilities of the deployment process. Rather than just providing the source code or binary of their program, they now set up a ready-to-use virtual image to run their code. We will explore the cultural implications on teamwork later, in chapter Learning from DevOps and MLOps Culture.

Infrastructure as code. To avoid repetitive manual system configuration work and potential inconsistencies from manual changes, there is a strong push toward automating configuration changes. That is, rather than remotely accessing a system to install a binary and adding a line to a configuration file, an operator will write a script to perform these changes and then run the script on the target machine. Even if the configuration is applied only to a single system, having a script will enable setting up the system again in the future, without having to remember all manual steps done in the past. In addition, by automating the configuration changes, it is now possible to apply the same steps to many machines to configure them consistently, for example to ensure that they all run the same version of Python and use the same firewall configuration.

In practice, rather than writing scripts from scratch, provisioning and configuration management software such as Ansible and Puppet provides abstractions for many common tasks. For example see the Ansible script below to install and configure Docker, which can be executed uniformly on many machines.

- hosts: all
  become: true
  tasks:
    - name: Install aptitude using apt
      apt: name=aptitude state=latest update_cache=yes
    - name: Install required system packages
      apt: name={{ item }} state=latest update_cache=yes
      loop: [ 'apt-transport-https', 'ca-certificates', ...]
    - name: Add Docker GPG apt Key
      apt_key:
       url: https://download.docker.com/linux/ubuntu/gpg
       state: present
    - name: Add Docker Repository
      apt_repository:
        repo: deb https://... bionic stable
        state: present
    - name: Update apt and install docker-ce
      apt: update_cache=yes name=docker-ce state=latest
    - name: Install Docker Module for Python
      pip:
        name: docker

Excerpt of an Ansible script to install Docker and Docker bindings for Python on Ubuntu, from https://github.com/do-community/ansible-playbooks

Since the configuration process is fully scripted (often in domain-specific languages), this approach is named infrastructure as code. The scripts and all configuration values they set are also tracked in version control, just like the source code of the project.

With the increased adoption of containers, configuration scripts are usually less used for setting up applications, but more to set up the infrastructure in which the containers can be executed (including installing the container technology itself as in the Ansible script above), to install some instrumentation for monitoring, and to configure the network.

Orchestrating and Scaling Deployments

In many modern systems, implementations are deployed across many networked machines and scaled dynamically with changing demands, adding or removing machines as needed. In such settings, after virtualizing all the components of a system, operators still need to deploy the containers to specific hardware, make decisions about how to allocate container instances to machines, and route requests. The cloud-computing community has developed much infrastructure to ease and automate such tasks, which is collectively known as orchestration.

The best known tool in this space right now is Kubernetes, a container orchestration tool. Given a set of containers and machines, the central Kubernetes management component allocates containers to machines, automatically launches and kills container instances based on either (a) operator commands or (b) automated tools that scale services based on load. Kubernetes also automates various other tasks such as restarting unresponsive containers, managing how requests are routed and balancing requests, and managing updates of containers. In our blogging scenario, Kubernetes would be a good match to deploy the various microservices and scale them when spikes of traffic hit; for example when a blog post goes viral it can rapidly scale the web server capacity, without needing to also scale the spam detector component.

*Overview of the Kubernetes architecture, where the Kubernetes master controls the allocation of containers to Kubernetes nodes, CC BY-SA 4.0* *Khtan66*

To make automated decisions, such as when to launch more instances of a container, orchestration software like Kubernetes typically integrates directly with monitoring infrastructure. It also integrates with the various cloud service providers to request virtual machines or other services as needed. Other approaches to cloud computing like serverless functions delegate the entire scaling of a service in a container to the cloud infrastructure.

This kind of orchestration and auto-scaling infrastructure is also commonly used in large data processing and machine-learning jobs, where Kubernetes, tools build on Kubernetes like KubeFlow, and various competitors all offer solutions to automatically allocate resources for large model training jobs and elastically scaling model inference services.

Of course many modern tools to make decisions about when and how to reconfigure the system may rely on machine learning and artificial intelligence components, learning to scale the system just at the right time or optimizing the system to achieve service level objectives at the lowest cost. Earlier work was often discussed under the terms auto scaling or self-adaptive systems; more recently the term AIOps has been fashionable for systems using machine learning for automated decisions during operations.

Orchestration services come with substantial complexity and a steep learning curve. Most organizations benefit from delaying adoption until they really need to scale across many machines and need to frequently allocate and reallocate resources to various containers. Commercial cloud offerings often hide a lot of complexity, when buying into their specific infrastructure. While much investment into orchestration may be delayed until scale is really needed, designing the system architecture such that it can be scaled with more machines later, as discussed in chapter Scaling the System, will ease a migration to more sophisticated container and orchestration infrastructure once needed.

Elevating Data Engineering

Many modern systems move and store vast amounts of data and this is particularly common for ML-enabled systems where data is often central. Handling large amounts of data places particular emphasis on data storage and data processing systems (discussed in chapter Scaling the System).

Data engineers are specialists with expertise in big data systems, in the extraction of information from data from many sources, and in the movement of data through different systems. The role of data engineers becomes increasingly important. Data engineers contribute not only to the initial system development and to building data processing pipelines for machine-learning tasks, but usually also to the continuous operation of the system, as data drifts and both the rate of new data and the total amount of data continuously grow. For example, with large jobs that move big amounts of data into and out of data lakes, failures and inconsistencies grow more likely. Understanding the various data processing tools, data quality assurance tools, and data monitoring tools requires a substantial investment, but adopting such tools provides a valuable infrastructure in the long run. There are many tools established and emerging in this field, such as EvidentlyAI for data drift detection and Datafold for data regression testing. In our scenario, these kinds of tools might help data engineers detect when the blogging system is overwhelmed with a new kind of spam strategy or when a minor change to data cleaning code in the ML pipeline accidentally discards 90 percent of the spam detector’s training data.

Incident Response Planning

Even the best systems will run into problems in production at some point. Even with anticipating mistakes and planning safety mechanisms, machine-learned models will surprise us eventually and may lead to bad outcomes. While we hope to never face a serious security breach, cause unsafe behavior (from causing stress, to bodily injury, to loss of life), or receive bad press for discriminatory decisions made by our spam filter, it is prudent to plan how to respond to bad events. Such kinds of plans are usually called incident response plans.

While we may not foresee the specific problem, we can anticipate that certain kinds of problems may happen, especially with unreliable predictions from machine-learned components. For each anticipated class of problems, we can then think about possible steps to take in responding when a problem occurs. The steps vary widely with the system and problems, but examples include:

Prepare channels to identify problems. This can include offering an explicit contact address inside and outside of the organization, but also proactive monitoring and anomaly detection of the system. For example, give users who discover discriminatory spam filtering decisions a way to reach your data scientists directly or indirectly.
Have experts on call. Prepare to have somebody with technical knowledge of the relevant system on call and prepare a list of experts of various parts of the system to contact in emergency situations. For example, know who to reach if the system is overwhelmed with spam after a broken update; have contact information of a security specialist if suspecting a breach; report all safety concerns to a compliance office in the organization.
Design processes and infrastructure for anticipated problems. Plan what process to use in case of a problem before the problem occurs and develop technical infrastructure to help mitigate the problem. For example, track used dependencies and their versions to quickly identify whether parts of a system are affected when a new vulnerability is detected in an open source project and make it a priority for developers to patch or replace the dependency. This may include functionality for short-term and long-term containment, such as mechanisms to shut down the system or specific features of the system quickly. For example, turn off the spam filter without taking down the entire system if anomalous or data-corrupting behavior is detected.
Prepare for recovery. Design processes to restore previous states of a system, such as rolling back updates or restoring from backups. For example, keep a backup of the history of blog posts and comments in case a bug in the editor destroys user data and prepare for the ability to rerun the spam filter on recent comments after rolling back a problematic model.
Proactively collect telemetry. If telemetry is available, it can often be used to later diagnose the root of a problem. The famous black box of airplanes is an example of this; in our blogging scenario having a log of predictions of the model can help to identify when and why a specific model performed poorly. It is advisable to store a copy of telemetry in a location separated from the system in operation.
Investigate incidents. Have a process to follow up on incidents, discover the root cause, learn from the incident, and implement changes to avoid future incidents of the same kind.
Plan public communication. Plan who is responsible for communicating with the public about the incident. For example, should operators maintain a status page or should the CEO or lawyers be involved before posting about the incident on Twitter?

While many aspects of incidence response plannings relate to processes, many can be supported with technical design decisions, such as telemetry design, versioning, and robust and low-latency deployment pipelines. Automation can be the key to reacting quickly, though unfortunately mistakes in automation can also be the cause of the incident.

DevOps and MLOps Principles

DevOps and MLOps are labels applied to the wide range of practices and tools discussed in this chapter. DevOps has a long tradition of bringing developers and operators together to improve the operations of software products. MLOps is the more recent variant of DevOps that focuses on operating machine-learning pipelines and deploying models.

Neither DevOps nor MLOps have a single widely agreed-upon definition, but they can be characterized by a set of common principles designed to reduce friction in workflows:

Consider the entire process and toolchain holistically. Rather than focusing on local development and individual tools, consider the entire process of creating, deploying, and operating software. Dismantle barriers between the process steps. For example, containers remove friction in deployment steps between development and operations, and data versioning tools track changes to training data within machine-learning pipelines.
Automation, automation, automation. Avoid error-prone and repetitive manual steps and instead automate everything that can be automated. Automation targets everything, including packaging software and models, configuring firewalls, data drift detection, and hyperparameter tuning. Automation increases reliability, enhances reproducibility, and improves the time to deploy updates.
Elastic infrastructure. Invest in infrastructure that can be scaled if needed. For example, design a system architecture with flexible and loosely coupled components that can be scaled independently as in our blogging platform scenario.
Document, test, and version everything. Document, test, and version not only code, but also scripts, configurations, data, data transformations, models, and all other parts of system development. Support rapid rollback if needed. For example, test the latency of models before deployment and test the deployment scripts themselves.
Iterate and release frequently. Work in small incremental iterations, release frequently and continuously to get rapid feedback on new developments and enable frequent experimentation. For example, experiment multiple times a day with different spam-detection prompts rather than preparing a big release once a month.
Emphasize observability. Monitor all aspects of system development, deployment, and operation to make data-driven decisions, possibly automatically. For example, automatically stop and roll back a model deployment if user complaints about spam suddenly rise.
Shared goals and responsibilities. The various techniques aim to break down the barriers between disciplines such as developers, operators, data scientists, and data engineers, giving them shared infrastructure and tooling, shared responsibilities, and shared goals. For example, get data scientists and operators to share responsibility for successful and frequent deployment of models.

*Stages in a DevOps toolchain (CC BY-SA 4.0* *Kharnagy*)

A recent interview study of MLOps practitioners characterized their primary goals as Velocity, Validation, and Versioning: Enabling quick prototyping and iteration through automation (velocity), automating testing and monitoring and getting rapid feedback on models (validation), and managing multiple versions of models and data for recovering from problems (versioning).

Discussions of DevOps and MLOps often focus heavily on the tooling that automates various tasks, and we list some tools below. However, beyond tools DevOps and MLOps often are a means to change the culture in an organization toward more collaboration — in this context, tools are supportive and enabling but secondary to process. We return to the cultural aspect and how DevOps can be a role model for interdisciplinary collaboration in chapter Interdisciplinary Teams.

Also notice that DevOps and MLOps are not necessarily a good fit for every single project. They are beneficial particularly in projects that are deployed on the web, that evolve quickly, and that depend heavily on continuous experimentation. But as usual, there are tradeoffs: Infrastructure needs to be designed and tailored to the needs of a specific project and the same techniques that benefit the operation of a massively distributed web service are likely not a fit for developing autonomous trains or systems with models packaged inside mobile apps. Seriously considering a system’s quality requirements (not just those of the ML components) and planning for operation already during the system design phase can pay off when the system is deployed and continues to evolve.

DevOps and MLOps Tooling

There is a crowded and evolving market for DevOps and MLOps tooling. At the time of writing, there is a constant stream of new MLOps tools with many competing open-source and commercial solutions and constant efforts to establish new terms like LLMOps. Here, we can only provide an overview of the kind of tools on the market.

For DevOps, the market is more established. Many well-known development tools serve important roles in a DevOps context. The following kinds of tools are usually associated with DevOps:

Communication and collaboration: Many tools for issue tracking, communication, and knowledge sharing are pervasively used to communicate within and across teams, such as Jira, Trello, Slack, wikis, and Read The Docs.
Version control: While there are many different version control systems, for source code, git is dominant and GitHub and Bitbucket are popular hosting sites.
Build and test automation: Build tools like make, maven, and Rake automate build steps. Test frameworks and runners like JUnit, JMeter, and Selenium automate various forms of tests. Package managers like packr and pip to bundle build results as executable packages and, for building containers, Docker is dominating the market.
Continuous integration and continuous deployment (CI/CD): The market for CI/CD tools that automate build, test, and deployment pipelines is vast. Jenkins is a popular open-source project for self-hosted installations, various commercial solutions such as TeamCity, JFrog, and Bamboo compete in this space, and numerous companies specialize in hosting build and deployment pipelines as a service, including Circle CI, GitHub Actions, Azure DevOps.
Infrastructure as code: Multiple tools automate configuration management tasks across large numbers of machines with scripts and declarative specifications, such as Ansible, Puppet, Chef, and Terraform.
Infrastructure as a service: Many cloud providers provide tooling and APIs to launch virtual machines and configure resources. This is a vast and competitive market with well known providers like Amazon’s AWS, Google’s Cloud Platform, Microsoft’s Azure, and Salesforce’s Heroku. Infrastructure like Openstack provides open-source alternatives.
Orchestration: In recent years, Kubernetes and related tools have dominated the market for orchestrating services across many machines. Docker Swarm and Apache Mesos are notable competitors, and cloud providers often have custom solutions, such as AWS Elastic Beanstalk.
Measurement and monitoring: The market for observability tools has exploded recently, and different tools specialize in different tasks. Logstash, Fluentd, and Loki are examples of tools that collect and aggregate log files, often connected to search and analysis engines. Prometheus, statsd, Graphite, and InfluxDB are examples of time series databases and collection services for measurements. Grafana and Kibana are common for building dashboards on top of such data sources. Many companies specialize in providing end-to-end observability solutions as a service that integrate many such functions, including Datadog, NewRelic, AppDynamics, and Dynatrace.

For MLOps, we see a quickly evolving tooling market that addresses many additional operations challenges, many of which we discuss in more detail in other chapters. MLOps covers many forms of tooling that explicitly address the collaboration across multiple teams, but also many that focus on providing scalable and robust solutions for standard machine learning tasks.

Pipeline and workflow automation: A large number of tools provide scalable infrastructure to define and automate pipelines and workflows, such as Apache Airflow, Luigi, Kubeflow, Argo, Semantic, Ploomber, and DVC (see chapter Automating the Pipeline). This can make it easier for data scientists to automate common tasks at scale.
Data management: Many different kinds of tools specialize in managing and processing data in machine-learning systems. Data storage solutions include traditional databases but also various data lake engines and data versioning solutions, such as lakeFS, Delta Lake, Quilt, Git LFS, and DVC. Data catalogs, such as DataHub, CKAN and Apache Atlas, index metadata to keep an overview of many data sources and enable teams to share datasets. Data validation, such as Great Expectations, DeepChecks, and TFDV, tools enforce schemas and check for outliers; drift detection libraries like Alibi Detect find distribution changes over time. All these tools help to negotiate sharing large amounts of data, data quality, and data versioning across teams (see also chapters Scaling the System, Data Quality, and Versioning, Provenance, and Reproducibility).
Automated and scalable model training: Many tools and platforms address scalability and automation in data transformation and model training steps, such as dask, Ray, and most machine-learning libraries like Tensorflow. Many hyperparameter-optimization tools and autoML tools attempt to make model training easier by (semi) automating steps like feature engineering, model selection, and hyperparameter selection with automated search strategies, such as hyperopt, talos, autoKeras, and NNI. These tools help scale the ML pipeline and make machine learning more accessible to non-experts.
Finetuning and prompt engineering: For foundation models, increasingly tools support fine tuning, prompt engineering, building prompt pipelines, optimizing prompts, and securing prompts. For example, LangChain is popular for sequencing prompts and reusing templates, DSPy abstracts prompt details and supports automated prompt optimization, PredictionGuard abstracts from specific language models and postprocesses their output, and Lakera Guard is one of many tools detecting prompt injection.
Experiment management, model registry, model versioning, and model metadata: To version models together with evaluation results and other metadata, several tools provide central platforms, such as MLFlow, Neptune, ModelDB, and Weights and Biases. These tools help teams navigate many experiments and model versions, for example, when deciding which model to deploy (see chapter Versioning, Provenance, and Reproducibility).
Model packaging, deployment, and serving: Making it easier to create high-performance model inference services, several tools, including BentoML, TensorFlow Serving, and Cortex, automate various packaging and deployment tasks, such as packaging models into services, providing APIs in various formats, managing load balancing, and even automating deployment to popular cloud platforms (see chapter Deploying a Model). With automation, such tools can reduce the need for explicit coordination between data scientists and operators.
Model and data monitoring: Beyond general-purpose DevOps observability tooling, several tools specialize in model and data monitoring. For example, Fiddler, Hydrosphere, Aporia, and Arize are all positioned as ML observability platforms with a focus on detecting anomalies and drift, alerting for model degradation in production, and debugging underperforming input distributions (see chapters Data Quality and Quality Assurance in Production).
Feature store: Feature stores have emerged as a class of tools to share and discover feature engineering code across teams and serve feature data efficiently. Popular tools include Feast and Tecton.
Integrated machine-learning platform: Many organizations integrate many different ML and MLOps tools in a cohesive platform, typically offered as a service. For example, AWS’ Sagemaker offers integrated support for model training, feature store, pipeline automation, model deployment, and model monitoring within the AWS ecosystem. Some of these integrated platforms, such as H2O, specialize in easily accessible user interfaces that require little or no coding skills. Many other organizations offer integrated platforms, including Databricks, Azure Machine Learning, and Valohai.
ML developer tools: Many tools aimed at the core work of data scientists are also often mentioned under the label MLOps, even if they do not necessarily relate to operations, such as data visualization, computational notebooks and other editors, no-code ML frameworks, differential privacy learning, model compression, model explainability and debugging, and data labeling. Many of these tools try to make data-science tasks easier and more accessible and to integrate them into routine workflows.

Summary

Operating a system reliably, at scale, and through regular updates is challenging, but a vast amount of accumulated experience and infrastructure can help to reduce friction. Just as with all other architecture and design decisions, some planning and upfront investment that is tailored to the needs of the specific system can ease operations. At a certain scale and velocity it is likely a good idea to bring experts, but an understanding of key ideas such as observability, release management, virtualization can help to establish a shared understanding and awareness of the needs of others in the project. Machine learning in software projects creates extra uncertainty, extra challenges, and extra tools during operations, but does not fundamentally change the key strategies of how different team members together can design the system for reliable operation and updates.