“Data cleaning and repairing account for about 60% of the work of data scientists.” — Gil Press. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes Magazine, 2016.

Machine learning approaches rely on data to train models. In production, user data, possibly enhanced by data from other sources, is fed into the model to make predictions. Production machine learning systems additionally collect telemetry data in production and make decisions based on the collected data. With all that reliance on data, system quality can suffer massively if data is of low quality. Models may be performing poorly…


This post is a summary of the “versioning, provenance, reproducibility” lecture of our Machine Learning in Production course; for other chapters see the table of content..

In production systems with machine learning components, updates and experiments are frequent. New updates to models may be released every day or every few minutes, and different users may see the results of different models as part of A/B experiments or canary releases (see QA in Production lecture). It may be hard to later figure out which model made a certain prediction or how that model was produced.

Motivating Example: Debugging a Prediction Problem

Let’s imagine we are responsible…


After teaching our Machine Learning in Production class (formerly “Software Engineering for AI-Enabled Systems”) four times, we stupidly made a decision we are going to soo regret when there are still so many chapters left: We are going to write a book with our collected material.

Image for post
Image for post

We will release the book publicly under creative commons license eventually. While we work on it, we are releasing individual chapters here, one at a time. We hope that more chapters below will be filled with links soon.

Table of Contents

Part 1: ML in Production: Going beyond the model

  • Introduction
  • From model to production system
  • Challenges of production machine learning systems

Part 2: Engineering Production AI Systems

  • Requirements engineering
  • 1. System…


Quality assurance of machine-learning models is difficult. The standard approach is to evaluate prediction accuracy on held-out data, but this is prone to many problems, including testing on data that’s not representative of production data, overfitting due to different forms of leakage, and not capturing data drift. Instead, many developers of systems with ML components focus on testing in production. — This post is a summary of the “quality assessment in production” lecture of our Machine Learning in Production course; for other chapters see the table of content.

Testing in Production — A Brief History

Testing in production is not a new idea and has been…


AI Alignment and Safety

Researchers in multiple communities (machine learning, formal methods, programming languages, security, software engineering) have embraced research on model robustness, typically cast as safety or security verification. They make continuous and impressive progress toward better, more scalable, or more accurate techniques to assess robustness and prove theorems, but the many papers I have read in the field essentially never talk about how this would be used to actually make a system safer. I argue that this is an example of the streetlight effect: We focus research on robustness verification, because it has well-defined measures to evaluate papers,whereas actual safety and security…


Software engineering education seems to focus a lot on process models (“software development life cycles”) like waterfall, spiral and agile. A number of models also describe the process of developing ML components (“machine learning workflow”). They don’t seem to fit well together, which causes a lot of confusion about the role that software engineering, MLOps and others play in building software systems with ML components, and where the real engineering challenges arise. Far from complete, here is the status of my current thinking… Feedback welcome.

ML Components in Software Systems

Outside of pure science projects, the ML model created with a machine learning framework is…


(this article corresponds to parts of the lecture “Risk and Planning for Mistakes” and the recitation “Requirements Decomposition” in our lecture Software Engineering for AI-Enabled Systems)

For exploring risk, feedback loops, safety, security, and fairness in production systems with machine learning components, to me the single most useful perspective is to look at the system through the lens of The World and the Machine, Michael Jackson’s classic ICSE 1995 requirements engineering paper.

The paper makes it very clear — what should maybe be already obvious — what a software system can and cannot reason about and where it needs to…


I’m planning a loose collection of summaries from lectures from my Software Engineering for AI-Enabled Systems course, starting here with with my Model Quality Lecture.

Model quality primarily refers to how well a machine-learned model (i.e., a function predicting outputs for given inputs) generalizes to unseen data. Data scientists routinely assess model quality with various accuracy measures on validation data, which seems somewhat similar to software testing. As I will discuss there are significant differences though, but also many directions where a software testing view likely can provide insights and directions beyond the classic methods taught in data science classes.


Image for post
Image for post
Dagstuhl, Feb 28, 2020

Disclaimer: The following discussion is aimed at clarifying terminology and concepts, but may be disappointing when expecting actionable insights. In the best case, it may provide inspiration for how to approach quality assurance in ML-enabled systems and what kind of techniques to explore.

TL;DR: I argue that machine learning corresponds to the requirements engineering phase of a project rather than the implementation phase and, as such, terminology that relates to validation (i.e., do we build the right system, given stakeholder needs) is more suitable than terminology that relates to verification (i.e., do we build the system right, given a specification)…


(Blog post migrated over from https://www.cs.cmu.edu/~ckaestne/featureflags/; we since published a paper on this discussion, including updates from interviews we conducted, at ICSE-SEIP 2020).

Having spent over a decade researching configurable systems and software product lines, the phenomenon of feature flags is interesting but also deeply familiar. After reading papers and blogs, watching talks, and interviewing several developers, we see some nuance but also many issues that we had thought were solved long ago.

Several research communities have looked into features, flags, and options: In the 90s researchers were investigating feature interactions in telecommunication systems and were worried about weird interaction…

Christian Kästner

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store