This chapter covers material from the introductory lecture of our Machine Learning in Production course. For other chapters see the table of content.

Video recording of the first lecture of our corresponding course, which covers many of the same topics.

Recently, machine learning has enabled incredible advances in the capabilities of software systems, allowing us to design systems that would have seemed like science fiction only one or two decades ago, such as medical assistance, personalized recommendations, interactive voice-controlled bots, and self-driving cars. Machine learning components are routinely integrated in web applications, mobile phone apps, and off-the-shelf devices, such as…


This post covers content from the “process and technical debt” lecture of our Machine Learning in Production course. For other chapters see the table of content.

Technical debt is a powerful metaphor to think about trading off short-term benefits with later or long-term costs. In monetary terms, going into debt by taking out a loan provides the borrower with money that can be used immediately but needs to be repaid later with accumulated interest. In the software development analogy, certain actions provide short-term benefits, such as faster progress and earlier releases, but come at the cost of lower productivity, needed…


This post covers content from the “Interpretability and Explainability” lecture of our Machine Learning in Production course. For other chapters see the table of content.

Machine-learned models are often opaque and make decisions that we do not understand. As discussed, we use machine learning precisely when we do not know how to solve a problem with fixed rules and rather try to learn from data instead; there are many examples of systems that seem to work and outperform humans, even though we have no idea of how they work. So, how can we trust models that we do not understand…


This post contains the third and final chapter in a series on the “model quality” lectures of our Machine Learning in Production course. The previous chapters discussed what correctness actually means for a machine-learned model (link) and how data scientists traditionally use various accuracy measures to evaluate model quality on test data (link). This chapter focuses on more advanced testing strategies, many developed very recently, inspired by testing techniques developed for traditional software. For other topics see the table of content. …


The data scientist’s toolbox and its pitfalls

This post is the second chapter of three on model quality of our Machine Learning in Production course. The previous chapter discussed what correctness actually means for a machine-learned model. This chapter focuses on traditional measures of prediction accuracy for machine-learned models and pitfalls of such evaluation strategy. The next chapter will then look more broadly at other testing techniques.

We start with an overview of traditional data science measures of prediction accuracy that are usually well covered in any machine learning course and textbook, before looking into the limitations and pitfalls of such an evaluation strategy. …


What’s wrong with i.i.d. evaluations? How do we identify capabilities? How do we create tests? Could the model generalize to out-of-distribution data?

Triggered by two papers [1, 2], I have been reading and thinking a lot recently about testing of machine-learned models. There seems to be a growing trend to test capabilities of a model in addition to just measuring its average prediction accuracy, which reminds me a lot of unit testing, and strongly shifts how to think about model evaluation.

In short, instead of just measuring generalization within a certain target population, we could try to evaluate whether the model learned certain capabilities, often key concepts of the problem, patterns of common mistakes to avoid, or strategies of how humans would…


All models are wrong, but some are useful. — George Box

This post is a first in a series of three that correspond to the “model quality” lectures of our Machine Learning in Production course, followed by Measuring Model Accuracy and Testing Models (Beyond Accuracy). A shorter version of these three chapters has been previously published as a blog post. For other topics see the table of content.

For data scientists, evaluations usually start with measuring prediction accuracy on a held-out dataset. While an accurate model does not guarantee that the overall system performs well in production when interacting with…


“Data cleaning and repairing account for about 60% of the work of data scientists.” — Gil Press. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes Magazine, 2016.

Machine learning approaches rely on data to train models. In production, user data, possibly enhanced by data from other sources, is fed into the model to make predictions. Production machine learning systems additionally collect telemetry data in production and make decisions based on the collected data. With all that reliance on data, system quality can suffer massively if data is of low quality. Models may be performing poorly…


This post is a summary of the “versioning, provenance, reproducibility” lecture of our Machine Learning in Production course; for other chapters see the table of content..

In production systems with machine learning components, updates and experiments are frequent. New updates to models may be released every day or every few minutes, and different users may see the results of different models as part of A/B experiments or canary releases (see QA in Production lecture). It may be hard to later figure out which model made a certain prediction or how that model was produced.

Motivating Example: Debugging a Prediction Problem

Let’s imagine we are responsible…


After teaching our Machine Learning in Production class (formerly “Software Engineering for AI-Enabled Systems”) four times, we stupidly made a decision we are going to soo regret when there are still so many chapters left: We are going to write a book with our collected material.

We will release the book publicly under creative commons license eventually. While we work on it, we are releasing individual chapters here, one at a time. We hope that more chapters below will be filled with links soon.

Table of Contents

Part 1: ML in Production: Going beyond the model

  • Introduction
  • From model to production system
  • Challenges of production machine learning systems

Part 2: Engineering Production AI Systems

  • Requirements engineering
  • 1. System…

Christian Kästner

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store