Quality Assurance in Production for ML-Enabled Systems

Quality assurance of machine-learning models is difficult. The standard approach is to evaluate prediction accuracy on held-out data, but this is prone to many problems, including testing on data that’s not representative of production data, overfitting due to different forms of leakage, and not capturing data drift. Instead, many developers of systems with ML components focus on testing in production. — This post is a summary of the “quality assessment in production” lecture of our Machine Learning in Production course; for other chapters see the table of content.

Testing in Production — A Brief History

Testing in production is not a new idea and has been used in systems without machine learning components for a long time. Testing on site will only get you so far, whether it’s developers writing unit tests, or testers clicking through an application, or invited users evaluating usability in the lab. Testing will never guarantee correctness (it can only show the presence of bugs, not their absence, as stated in a famous quote by Dijkstra), and hence we will never be able to test all possible things that a user might do in production.

Alpha testing and beta testing were traditionally common ways of getting users to test a product for you at very early stages (alpha testing) and near-release stages (beta testing). Companies would recruit hundreds or thousands of volunteers who would use early versions of the product and hopefully report bugs and answer surveys. For example, 50,000 beta testers helped with Windows 95, many of whom actually paid $20 for the privilege.

Note that in pre-Internet days, beta testers would have to phone or mail bug reports or survey answers. With internet connectivity, companies slowly started to put telemetry into their beta versions and finally also their main products, that is, integrate mechanisms to automatically send information about how the application is used to the company’s servers.

The first big telemetry step was probably crash reports, where the system would send information about crashes (e.g., stack traces) home to the company, so that they would gain insights into bugs that occur in production. Actually, companies like Microsoft started to receive so many crash reports that they started to invest in research to automatically deduplicate and triage them.

Soon, many products started collecting telemetry not only about crashes but about all kinds of usage behavior; e.g., what features users use, how much time they spend on certain screens. To developers and product managers, such information about user behavior is exciting, can help to identify where users struggle, or what features they most benefit from, and can help to actually make the products better. A whole market developed for tools to instrument programs and apps with telemetry and to analyze telemetry data. For desktop applications, most companies allow you to opt out of sending telemetry.

As products move online into the cloud, lots of usage data is collected directly on the server side. Server logs can provide huge amounts of insights into what users are doing, without sending data from a client and without the corresponding pesky consent forms of desktop applications (instead you have privacy policies that nobody reads).

Two different layouts of a website shown to different potential customers. Customers who have seen the second version were more likely to buy the product. Example from https://designforfounders.com/ab-testing-examples/

Once engineers figured out how to observe user behavior, they started experimenting on them. They started giving different users slightly different versions of the product to see whether users behave differently (e.g., buy more stuff) or whether the version crashes more frequently. Several forms of experiments emerged, most prominently canary releases and A/B tests. Canary releases provide a new release to an initially small subset of users to observe whether the new release is stable (based on crash report telemetry), before releasing it incrementally to more and more users. A/B tests show different variants of a product to different users (e.g., different layouts or font sizes) to see whether user behavior (from usage data telemetry) is different between the different user groups. Companies with many users often run hundreds of experiments in parallel and can detect even small effects of design and implementation changes on user’s chances to click on things or buy things.

A final, maybe most daring step of testing in production, is chaos engineering. Chaos engineering is primarily a form of testing robustness of distributed systems, such as the AWS-based cloud infrastructure of Netflix where the concept was first introduced. To see how the whole system reacts to specific outages, an automated tool named ChaosMonkey would deliberately take down servers and observe through telemetry whether overall system stability or user experience was affected (e.g., how many users’ Netflix stream was interrupted). While it initially may seem crazy to deliberately inject faults in a production system, this strategy instills a culture for developer teams to anticipate faults and write more robust software (e.g., error recovery, failover mechanisms) and it allows robustness testing of distributed systems at a scale and realism that is difficult to mirror in a test environment.

Assessing Model Quality in Production

Production data is the ultimate unseen test data. It also obviously does not suffer from a potential mismatch between the distributions of test and production data. Offline evaluations are always prone to overfitting, due to repeated evaluations or leakage between training and evaluation data (see model quality lecture) and we have to worry whether our validation data actually represents usage data in production. It is also dangerous when evaluating on test data from a distribution that no longer matches the distribution observed in production, say due to sampling biases or data drift. By testing in production, we avoid these issues. With production data, we can get a more accurate picture of whether our model generalizes and how well it does in production — -at least how well it did in production so far.

While it is easy to simply log all data that goes into a model for predictions and log all the predictions the model makes, to measure some notion of prediction accuracy, the key question is how we know what the right prediction would have been. That is, what data can we log or collect to assess whether a model’s prediction was right? This is where we need to be creative.

The key problem here is to collect some telemetry information that would give us insights into the correctness of predictions. There is no single recipe, but there are common patterns:

  • Wait and see: In many situations where we predict the future (weather, tomorrow’s flight ticket price, sales price of a home), we can simply wait and record the right answer later. If we then compare the right answer with our prediction, we can compute a prediction accuracy score. The problem is that our accuracy results will not be available until later, but that may be okay for many purposes. One also needs to be careful if the prediction might influence the outcome — it is unlikely that a model’s weather prediction will influence the weather, but predicting rising ticket prices may lead more people to buy tickets thus raising the prices even more.
  • Ask users: Just ask users whether the prediction was correct. For example, an ML model for tagging friends in images on a social media platform can simply ask whether the tags are correct and whether people are missing. Users might be willing to provide the labels for you and tag the missing friends. Of course you probably do not want to ask users for every single prediction, but if you have enough users, even just asking in 1 of 1000 predictions might give you a lot of labeled data to work with.
  • Crowd-source labeling: If you don’t want to ask your users, you can always ask others to look at a prediction and judge whether it was right or wrong. Just as crowd workers can provide labels for your training data, you can ask crowd workers to label a sample of production data and use that for evaluation. This should work for all settings where crowd workers can label training data in the first place, but it can be expensive to continuously label samples of production data over extended periods. In some contexts, most commonly discussed in testing self-driving cars, shadow execution may be an opportunity too: Here you deploy the system without actively using predictions of the model, but instead let humans operate as they traditionally would. By recording what humans did, you can compare to what degree the model would have predicted the same action.
  • Allow users to complain: You may not want to bug users and ask about how you are doing, but you may give users an easy way to complain. You won’t learn about every single problem and you will never get positive feedback, but on the grand scheme of things you may be able to observe whether the rate of problem reports decreases after a model update.
  • Observe user reactions: And finally, there are many ways to learn about the quality of your predictions simply from observing how users react to them. For example, if users (without being prompted) tag or remove additional friends in their photos, this is a sign that the model missed something. If users make two very similar requests, it might indicate that they weren’t happy with the first prediction (e.g., speech recognition). If users don’t click on the recommended movies, maybe your recommendations weren’t that great. There are lots of possibilities here to infer something about the quality of the predictions from how users react to them, but what data to collect depends a lot on how users interact with the system.

Importantly, you may design your system in a way that makes it more likely that users interact with the system in such a way that you can learn something meaningful from it. My favorite example is the transcription service Temi.com, which transcribes audio files: They could have just offered .docx files with the transcription for download, but they instead implemented a very useful user interface in which one can read the transcript and listen to the corresponding audio snippets (audio and text are synched). This keeps users on the site and encourages them to make corrections to the transcript in the text editor on the site, rather than in an external text editor. Highlighting words the system is less sure about further encourages users to stay on the site and look and revise possibly problematic parts. When users correct the transcript, they provide very fine-grained information about which words were incorrectly transcribed and what the correct transcription would have been — valuable telemetry information without ever asking the user explicitly or paying a crowd worker.

Notice that different scenarios and different telemetry designs can yield insights with varying qualities. In some cases it’s really easy to get very accurate labels on production data, in some cases we only get information about mistakes (i.e., can approximate precision, but not recall), sometimes we get labels and thus accuracy results only delayed, often we only evaluate samples, and sometimes we only get weak proxies that are hopefully correlated with model quality. However, even with samples and proxies, we can often think of ways to at least approximate some recall and precision measure, and even if it is not directly comparable to accuracy measures evaluated offline on held-out test data, it may still provide insights as part of trend analysis or A/B experiments.

The Data Flywheel: How Enlightened Self-Interest Drives Data Network Effects
The Machine Learning Flywheel (graphic by CBInsights)

Aside: Data that we are collecting at runtime is not only useful for assessing model quality and debugging, it can also often be used as training data for future versions of the model. This is often an important part of what’s called the machine learning flywheel: If we build better products, we get more users; if we get more users, we can collect more production data; if we have more production data, we can build better models; with better models, our products get better, and the cycle repeats.

Assessing System Quality/Goals in Production

An ML model is just a component in a system that contributes toward some larger system or business goals (“key performance indicators” in management speak). In many settings it may be difficult to get good telemetry that provides insights directly for the correctness of a ML model component, but it may be easier to collect data on how the system is doing overall. Understanding how the system performs overall and contributes to the business goals is usually also the more important measure and one that essentially must be assessed in production.

It is of course useful to be clear about what the system goals are and what leading indicators or user outcomes are worth observing (see goals lecture). Many of them can be quantified with suitable telemetry easily, such as: (1) how many new users sign up for the service every week and how many stop using it, (2) how many products are we selling/what’s our daily revenue, (3) how many ads are users clicking on, (4) how much time do users spend on our page, or (5) what’s the average user rating in reviews for our app. Many of these kinds of metrics can be collected from databases or logs or with simple extra instrumentation of the apps. Note that these system/business metrics can be measured entirely without having to decide whether a ML model inside the system was accurate or not.

For many companies, optimizing business metrics is the main goal. Improving machine learning models is just a means to improving business goals. Read for example, this paper from bookings.com that discusses how they develop various ML models just to get people to book more hotels with them. It doesn’t really matter whether those models are particularly accurate, as long as they contribute to sales. They even discuss several cases where more accurate ML models hurt sales, for example, speculating that too good predictions are creepy to customers (uncanny valley effect). This is another reason why testing in production is so important.

Designing Telemetry

To measure any quality in production, whether it is model accuracy or business goals, we need three ingredients:

  1. Metric: First, we need to define the quality metric of interest. Prediction accuracy? Recall? Revenue per day? Average time spent on page?
  2. Data collection: Next, we need to identify what data we can collect in production. User activity from log files? Number of complaints through a button in the app? Number of times users ask for multiple predictions in quick succession? Actual ticket prices a week later? We also need to decide whether we are collecting all data, aggregate data, and whether we are sampling in some form.
  3. Operationalization: Finally, we need a description of how we compute the goal metric from the collected data. For example, how we measure the time spent on the page based on log files. Operationalization is sometimes straightforward (e.g., collecting revenue from sales logs) and sometimes highly nontrivial (e.g., how we distinguish retry attempts indicating poor predictions from users simply trying predictions for different inputs).

It is not always possible to collect the metric directly we would wish to collect, so often we need to go for proxy metrics. Still one should be precise about defining the actual metric measured and then be aware that it is just a proxy for the actual quality of interest when interpreting it.

As an example, let’s say we want to see how well our transcription service transcribes an audio file into text. We likely won’t have the ground truth of the actually correct translation, so computing traditional NLP accuracy measures like BLEU will be infeasible in production. Instead, we can come up with the proxy measures of “average star rating given by users” and “average percentage of words corrected by users who make any corrections” (note that we may want to exclude users who do not engage with correcting their transcript at all, rather than assuming that they are perfectly satisfied). As telemetry we would collect star ratings and all word corrections done in the user interface described above, where we count any sequences added or removed characters in any location of the document as a correction. We probably do not need to sample, but can just collect all ratings and all collections without overwhelming our servers. With this data, it is then relatively straightforward to operationalize the metrics: For the average star rating, we could use a sliding window approach and compute the average of all stars given in any 24 hour time window or always the average of the last 100 star ratings. For the corrections, we could count corrections and the total number of words by document and report the average ratio of corrections and words across all documents with non-zero corrections, again for any 24 hour time window. There is significant flexibility on how to design these metrics and telemetry. For example, we could improve the above approach by tracking which parts of the document a user has looked at and counting only corrections in those parts.

Identifying what data to collect to measure quality in the first place can be a significant challenge that may require a lot of creativity. Beyond that, there are many more engineering challenges…

First of all, the amount of data produced in production can be huge and it is not clear that all of that should be stored for telemetry. While in some cases storing logs and model inputs of all predictions is feasible, in other cases (especially with audio and video) the amount of data can quickly become too huge. For example, a video chat system with an ML model to create virtual backgrounds may not want to store recordings of all video chats on the platform, just to possibly later evaluate the quality of the virtual backgrounds. To reduce the amount of data, one typically either has to sample and collect only a subset of the data or extract data and store only the summaries (i.e., store extracted features rather than raw data). Sampling can be quite sophisticated when the target events are rare or when adaptive targeting is needed to focus on specific problematic subpopulations.

The deployment architecture is a key driver for how to design telemetry too. In a cloud-based system, data is already on the company’s server, whereas desktop applications, mobile apps, or IoT devices may use ML models locally and may not want to (or can) upload all raw data as telemetry.

Privacy is also often a big issue when considering what data to collect. For example, Amazon’s privacy policy retains the right to collect and store voice recordings from their Alexa devices, but has repeatedly gotten bad press about it. On the other hand, if one promises that ML models and data never leave the user’s devices (as with Apple’s FaceID or various smart keyboard implementations), it limits the amount of telemetry one can collect. Collecting anonymized, summarized information rather than raw data may be a compromise.

Another big challenge is how to design a user interface that facilitates collecting meaningful data without being intrusive to the user experience. Simple changes to the interface, such as encouraging users to make corrections within the user interface or allowing them easily to discard useless suggestions, can significantly enhance what telemetry information can be collected.

Finally it’s worth considering how to handle telemetry while the system is temporarily offline (e.g., a mobile app while in an airplane), or has a bad or expensive internet connection, or even permanently offline. Should telemetry be collected locally and sent later? Is late telemetry still useful? Should something be collected for local diagnostics?

Overall designing good telemetry systems is an important and nontrivial engineering problem that may require substantial creativity and may even influence how the product overall is designed (as in the example of designing a better frontend to collect better telemetry in the transcription example above).

As a simple exercise, think through possible strategies to collect telemetry to assess model and system quality for the following scenarios, be explicit about metrics, what data to collect and how much, and how to operationalize the metric:

  • Was the sales price/value of a house on Zillow predicted correctly?
  • Did the ShoeGazer app correctly recognize the shoe brand from a photo?
  • Did the profanity filter remove the right blog comments?
  • Was there cancer in the medical image?
  • Was a generated Spotify playlist good?
  • Was the ranking of search results good?
  • Were friends tagged correctly in the photo collection?
  • Did the self-driving car break at the right moment? Did it detect the pedestrians?
  • SmartHome: Does it automatically turn off the lights/lock the doors/close the window at the right time?
  • News website: Does it pick the headline alternative that attracts a user’s attention most?

Monitoring and Experimenting in Production

With a solid telemetry system in place, we now have the foundation for monitoring and experimentation. All of these approaches work equally for assessing model quality and for observing system success, depending on what telemetry is collected.

The first step is usually to set up a monitoring infrastructure to plot qualities over time and possibly also for different populations. For example, in the plot below, we can see that we observe fairly similar model quality for a movie recommendation service across 5 subpopulations, until at a certain point in time there is a huge jump for one subpopulation. Maybe an update has happened or something really suspicious is going on.

Another pattern that is quite common is to observe slow degradation over time due to some form of drift, that is the distribution of inputs or the expected outputs change over time, say because customers’ tastes change or new products are released that need to be considered. Retraining on more recent training data (often from telemetry, see flywheel above) is a mitigation and observing quality trends in telemetry can provide insights in when retraining is needed.

Image source: Joel Thomas and Clemens Mewald. Productionizing Machine Learning: From Deployment to Drift Detection. Databricks Blog, 2019

Technically, there are lots of tools that make it easy to create nice looking dashboards, such as Grafana. In addition, lots of infrastructure tools have been built to collect, analyze, and aggregate log data at scale, including Prometheus, Loki, Logstash, Elasticsearch, and many more. It’s usually easy to set up alert systems as well that send messages when metrics exceed certain thresholds. All of this is also readily available through cloud offerings such as AWS. The initial learning curve can be a bit steep, but these are powerful and scalable tools that are worth investigating. As a starting point, have a look at Cohen’s “Monitor! Stop Being A Blind Data-Scientist.”

Classic human-subject experiments in science or medicine divide participants into two groups, give one group the treatment and use the other group without the treatment as control group, and then compare whether there are differences in outcomes between the groups. Running experiments is challenging, because it is usually expensive and difficult to recruit enough participants and it is important but nontrivial to ensure that the two groups are comparable (i.e., not more healthy individuals in one group for a medical trial).

Cloud-based and internet-connected services with many users provide perfect conditions to run controlled experiments. One can simply divide users into two groups and show them different versions of the product. No recruitment costs! Often thousands or millions of participants in trials! (Also typically no informed consent and no institutional review board to review ethics of experiments, but let’s leave this discussion for another time.) With a large number of participants randomly assigned to groups, one also has to worry less about chances of unbalanced groups and or missing small effects in random noise. With enough users, it becomes feasible to run hundreds of experiments at the same time (as many big tech companies do these days), to constantly experiment, and to validate most development and make most decisions with data from experiments. This is another example of the flywheel effect: The more users you have, the easier it is to experiment in production, and the better the foundation to further improve the product, to attract more users.

The idea behind A/B experiments is straightforward: Simply assign some percentage of your users to a treatment group and serve them a different variant of the product that you want to evaluate, typically differing only in a small change, say a different layout or a newer ML model.

Implementations are usually also not too hard. We need three ingredients:

  1. two or more alternative implementation variants to be tested,
  2. a mechanism to map users to those variants, and
  3. a mechanism to collect telemetry separately for each variant.

Implementing variants. To serve two alternative variants, one can either deploy two separate variants and decide which users to route to which variant with a load balancer, or one can encode the differences as a simple control-flow decision within a single application — a practice called feature flags.

if (features.enabled(userId, "new_model_experiment5")) {
// new feature extraction
// predict with new model
} else {
// old feature extraction
// predict with old model

Feature flags are simply boolean options that are used to decide between two control flow paths in the implementation. Ideally they are tracked explicitly, documented, and removed once the experiment is done.

Mapping users to variants. Now we need to decide which users will see which variants. Random assignment is usually a good idea to avoid bias in how groups are formed (e.g., all experienced users in one group). It is also usually a good idea to keep specific users in stable groups, so that they don’t see different variants whenever they reload the web page (or open the app or request a prediction, …). Even without user accounts it’s worth trying to identify repeat users to keep them in the same group.

The mapping can be as simple as this function, randomly selecting 10% of all users:

def isEnabled(userId): Boolean = (hash(userId) % 100) < 10

It’s also possible to divide users into groups and randomly assign within groups, say 50% of beta-users, 90% of developers (dogfooding), and 0.1% of all normal users.

Telemetry. Finally, it is important that telemetry about model or service quality can be mapped to experimental conditions. That is, we need to know which sales were associated with users who saw predictions by the new model vs users who saw predictions by the old model. We might be able to reverse engineer this from the user-variant mapping + time of the experiment or we might include the variant used as part of the log file.

Now it should be straightforward to compute metrics for both groups.

Statistics of A/B experiments. It seems tempting just to compare the quality metrics for both groups, say the average time a user spends on page. We might see that 2158 users in the control group with the old recommendation system spend on average 3:13 minutes on the page, whereas the 10 users in the treatment group with the new prediction model spend on average 3:24 minutes on the page — great a 5% improvement from the new model.

The obvious problem is that each user is different and spends a different amount of time on the site. Maybe we were just lucky to have users who usually spend more time in the treatment group? This is where statistics can help us separate noise from actual effects.

In each group, we will see a distribution with some users spending more time on the web site and others less. The key question we want to answer is whether the difference in the A/B experiment (the new model) is responsible for the difference in the distributions, or whether we may just have two samples from the same distribution that look slightly different by chance. This is exactly what statistical tests do.

In a nutshell, the more samples one has in each group, the easier it is to be confident that an observed difference is not due to random noise. Larger differences are easier to detect than smaller, but with large enough sample sizes even small differences can be detected reliably and confidently. Medicine and psychology researchers usually run experiments with dozens or maybe a few hundred participants if they are lucky; big tech companies can grab 0.05% of their users and still have tens of thousands of users in the treatment group.

The more samples we collect, the narrower our confidence interval gets. Notice that we have very little confidence about the mean with less than 5 samples, but get a fairly narrow interval after about 70 samples.

Two concepts are helpful to get familiar with: confidence intervals and the t-test. Confidence intervals provide a range in which to expect the true mean of a distribution from which we have only a sample (given a confidence level, often 95%). The statistically correct definition is a bit complicated, but a rough approximation for practical use is that we expect the true mean of the distribution to be within the interval with 95% probability. The more samples we have, the narrower the confidence interval usually gets, because we gain more confidence about the mean of the distribution. If we observe two measurements with confidence intervals that do not overlap, we can be pretty confident that the difference is not due to chance.

The t-test is a statistical test to check whether two samples are likely actually from different distributions, that is, whether we observe an actual difference between the treatment and control group rather than just one from random noise (again the correct statistical explanation is more complicated, but an intuitive understanding is sufficient here). The typical convention is again that if we exclude by 95% percent that a result could come from random noise, we consider the difference statistically significant. For example, below we see a t-test computed in R, which shows that the two samples with mean 51.42 and 49.37 are most likely from different distributions:

> t.test(x, y, conf.level=0.95)

Welch Two Sample t-test

t = 1.9988, df = 95.801, p-value = 0.04846
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
0.3464147 3.7520619
sample estimates:
mean of x mean of y
51.42307 49.3738

The t-test technically has a number of assumptions about the data, but given the large sample sizes we typically have during A/B testing, we usually do not need to worry here. If you are comparing small samples, ask a statistician for help.

It is common that dashboards showing results for A/B experiments show both confidence results for the observed metrics (e.g., for “conversion rates” in the screenshot below) and that they indicate whether the observed difference is statistically significant (“chance to beat Baseline” in the screenshot below):

Example of a dashboard showing the result of an A/B experiment from https://conversionsciences.com/ab-testing-statistics/

Finally, it’s possible to run multiple experiments and look for interactions among multiple treatments. For this, it’s worth working with somebody with experience in scientific experiments and statistics and to look into multi-factorial designs.

Number of participants needed. Typical A/B experiments are run for a substantial number of participants in the treatment group (thousands) and over multiple hours or days. One the one hand, one needs enough participants in the treatment group to see a statistically significant effect, and the smaller the expected improvement that we want to detect, the more participants we need. On the other hand, we want to keep the treatment group relatively small, so that in the case of a failed experiment (e.g., the new model is actually much worse) we do not expose to many users — this is also known as “minimizing the blast zone”.

In theory, power statistics provide some means to compute needed sizes. In practice, most people running A/B tests will develop some intuition, usually use a somewhat smallish percent of their user base (say 0.5 to 10%) for the treatment group, and run experiments for a while until they see statistically significant results. Big tech companies with millions of active users can run hundreds of experiments in parallel, each with thousands of users, without ever exposing more than 1% of their users to a given treatment.

Complications. A/B experiments are simple, but there are many further pitfalls, known from traditional experimental design, statistics, and psychology. Examples include novelty effects, where users respond positively to the treatment for a short period just because it is new, not because it is better; this could show positive effects of an experiment that wears off quickly in practice. Similarly, but with an opposite effect, many long-term users may resist change and may not use the new feature from the experiment in the short term, underestimating the real benefit of the treatment. Beyond that, there are also various forms of interference, where control and treatment groups are not truly independent. For instance, an experiment may influence the behavior of the experiment group in a way that is observable by the users in the control group; e.g., Facebook users in the control group see users in the experiment group posting more cat memes and mirror that behavior, or a new Youtube recommendation algorithm favors videos with “positive sentiment”, which might encourage video producers to change their behavior without being part of the experiment.

All these problems are generally well understood in science and psychology and there are solutions in terms of experimental design and statistics to address them. As experimentation in an organization matures, it is a good idea to look deeper into relevant research methods and recruit a specialist.

Infrastructure. It is fairly straightforward to set up an A/B experiment by introducing a few feature flags, a user-variant mapping, and suitable telemetry, and maybe throw in a t-test. However, it is often a good idea to invest in some infrastructure to make experimentation easy and flexible to encourage data scientists and software engineers to experiment frequently (“continuous experimentation”) and make evidence-based decisions and improvements based on experiment results.

To this end, a feature flag library, a flexible mechanism for user-variant mapping, and a solid dashboard to show experimental results are worthy investments. To push this further, it’s possible to build infrastructure to schedule and queue multiple experiments, automatically stop experiments when results can be reported with confidence, automatically stop experiments if something goes bad, and so forth. For inspiration, have a look at the experience shared by both Google and Facebook about their internal infrastructure in academic papers. Many open source projects provide foundations for individual parts, but such infrastructure is also commercially available as a service, for example, from launchdarkly and split.io.

Recently, the MLOps community has invested in a lot of infrastructure to make it easy to deploy different ML model variants and to experiment in production.

The term canary release comes from the practice of bringing canary birds into coal mines, because the specially-bred birds were more sensitive to carbon monoxide and would die before the gas levels become lethal for the human miners. Canary releases try to detect release problems early.

Canary releases are the idea that new revisions of a product or ML model are released incrementally to an increasingly larger user base, while monitoring the release to roll it back as soon as possible if problems are observed before too many other users are affected (often as part of a continuous deployment pipeline). That is, at first the new revision is released only to, say, 1% of the users, then 5%, then 20%, and then all. At each step, we monitor whether the new revision is stable, at least not performing worse than the previous one. The entire process, including rollbacks of bad releases, can be automated.

Canary releases are very similar to A/B experiments at a technical level, but they are used with different goals. Whereas A/B experiments analyze an experimental change to see whether it improves the product, canary releases are usually designed as a safety net for releasing intended changes. Still, they use very similar techniques to encode variants (load balancing or feature flags), similar ways of mapping users to variants, and similar ways to track success and make decisions through telemetry. The key differences are that the user-variant mapping is adjusted over time to show the new release to more users and decisions based on telemetry (rollout to more users or rollback to old version) are automated.

Experimental infrastructure can often be designed to handle A/B tests and canary releases together. Services like launchdarkly and split.io support both. Canary releases are usually integrated into a larger continuous deployment pipeline, where changes are first tested offline and then scheduled for automatic canary releases.

A note on terminology: Revisions refer to versions over time, where one revision succeeds another. Variants refer to versions that exist in parallel, for example, models for different countries or models still to be evaluated as part of an A/B test. Version is the general term that refers to both revisions and variants.

Shadow releases, also known as traffic teeing, is the idea of running two variants in parallel, but shows only the predictions of the old model to the user. That is, we do not see the effect of the new model on the user and the system, but we can observe whether the new model is stable and how often it differs from the predictions of the old model. If we can gain some ground truth labels for the production data, we can even compare accuracy of the two models.

Shadow releases are traditionally used to test stability and latency regressions, but have also been used in testing self-driving cars.

Blue/green deployment is a name for experiments where both the old and the new variant of a model or system are deployed and a load-balancer (or feature flag) immediately switches all traffic to the new version. That is, in contrast to incremental deployments in canary releases, the new variant is released to everybody at the same time (avoiding potential consistency challenges from running different variants in parallel), but the infrastructure allows to immediately undo the release if something goes wrong.

Chaos engineering was originally developed to test robustness in a distributed system against server outage or network latency or other distributed system problems (see above). The idea can naturally be extended to machine learning components, especially when testing infrastructure robustness (see lecture on infrastructure quality), say, “would we detect if we released a stale model” or “would system quality suffer if this model would be 10% less accurate or have 15% more latency”?

Experimenting in production can be extremely powerful once an active user base is available.

Whenever one experiments in production, things can go wrong. Experiments can fail, releases can break functionality, injected small faults of a chaos experiment can take the entire service down. A key principle is always to minimize the blast radius, that is, to plan experiments such that the consequences of bad outcomes are limited. That’s the reason why the treatment group for A/B experiment is usually fairly small, that’s why canary releases are increasing the rollout slowly from few to more users, and why people conducting chaos experiments usually think quite carefully how to isolate effects to few users or subsystems.

For the same reason it is important to be able to quickly change deployments; for example, abort a release or redirect all traffic of the treatment group back to the original code version. Many good engineering practices and DevOps tools are helpful here, including rigorous versioning, containerization, and load balancers.

Automation is similarly important to allow systems to act quickly on insights, for example, not having to wait for humans to react to alerts and then having to look up which shell scripts they need to run to rollback a release, possibly creating an even larger mess in the process by changing or deleting the wrong files. Automation will also allow us to run experiments for as long as needed and then immediately start the next experiment, especially in organizations that run a huge number of experiments. Good infrastructure tooling will automate steps, take friction out of the process, and increase turnaround time, and thus encourage developers to experiment more and make more decisions based on empirical observations in the production system.

Any observations from a production system will be noisy, so gaining some confidence by performing statistically rigorous analyses is helpful to reduce the chance to make decisions based on experiment results that stem from pure chance.

Consider building on the extensive experience and infrastructure shared by others. The DevOps community has long experience with monitoring, A/B experiments, and canary releases. The MLOps community brings in specialized tools for machine-learning pipelines and artifacts and integrates with many traditional DevOps practices — see pointers below. (AIOps is another buzzword, that tends to refer to use machine learning to make decisions during operations in a data center, e.g., when to scale up deployment to more machines, and DataOps usually refers to agile data analytics, reporting, and data-driven decision making in a business setting.)

And finally, whenever experimenting in production, be aware that you are experimenting on your users, real humans. Even with minimizing the blast radius and quick rollbacks, depending on the system, there is a chance for real harm to some users. Extensive A/B testing is also powerful in identifying how to best design a manipulative system to exploit human weaknesses, exploit human biases, and foster addiction that lead to systems that may optimize some success measure (e.g., sales, ads clicked, time spent on the web page) but that may not be good for users as individual persons. Considering ethics and safety of experiments in production is important for responsible engineering.


In summary, quality assurance and experimenting in production can be extremely powerful.

Production data is the ultimate unseen test data and does not suffer from many problems of traditional offline evaluations of machine-learning models.

Designing good telemetry that can capture key qualities of the model or the system is the key challenge. It is a design and engineering problem but also an opportunity.

Monitoring helps to observe how the system is doing in production. Dashboards and automated alerts can be powerful mechanisms to provide insights into how the model or the system is doing in production.

There are many powerful forms of experiments (A/B testing, canary releases, shadow releases, chaos engineering) each with their own challenges and benefits.

Statistics can help to gain confidence in results of online experiments despite typical noise.

Solid infrastructure and automation is highly beneficial. Many DevOps and MLOps tools can make it easy for developers and data scientists to run experiments in production.

Additional resources

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store