This post contains the third and final chapter in a series on the “model quality” lectures of our Machine Learning in Production course. The previous chapters discussed what correctness actually means for a machine-learned model (link) and how data scientists traditionally use various accuracy measures to evaluate model quality on test data (link). This chapter focuses on more advanced testing strategies, many developed very recently, inspired by testing techniques developed for traditional software. For other topics see the table of content. …

The data scientist’s toolbox and its pitfalls

This post is the second chapter of three on model quality of our Machine Learning in Production course. The previous chapter discussed what correctness actually means for a machine-learned model. This chapter focuses on traditional measures of prediction accuracy for machine-learned models and pitfalls of such evaluation strategy. The next chapter will then look more broadly at other testing techniques.

We start with an overview of traditional data science measures of prediction accuracy that are usually well covered in any machine learning course and textbook, before looking into the limitations and pitfalls of such an evaluation strategy. …

What’s wrong with i.i.d. evaluations? How do we identify capabilities? How do we create tests? Could the model generalize to out-of-distribution data?

Triggered by two papers [1, 2], I have been reading and thinking a lot recently about testing of machine-learned models. There seems to be a growing trend to test capabilities of a model in addition to just measuring its average prediction accuracy, which reminds me a lot of unit testing, and strongly shifts how to think about model evaluation.

In short, instead of just measuring generalization within a certain target population, we could try to evaluate whether the model learned certain capabilities, often key concepts of the problem, patterns of common mistakes to avoid, or strategies of how humans would…

All models are wrong, but some are useful. — George Box

This post is a first in a series of three that correspond to the “model quality” lectures of our Machine Learning in Production course, followed by Measuring Model Accuracy and Testing Models (Beyond Accuracy). A shorter version of these three chapters has been previously published as a blog post. For other topics see the table of content.

For data scientists, evaluations usually start with measuring prediction accuracy on a held-out dataset. While an accurate model does not guarantee that the overall system performs well in production when interacting with…

“Data cleaning and repairing account for about 60% of the work of data scientists.” — Gil Press. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes Magazine, 2016.

Machine learning approaches rely on data to train models. In production, user data, possibly enhanced by data from other sources, is fed into the model to make predictions. Production machine learning systems additionally collect telemetry data in production and make decisions based on the collected data. With all that reliance on data, system quality can suffer massively if data is of low quality. Models may be performing poorly…

This post is a summary of the “versioning, provenance, reproducibility” lecture of our Machine Learning in Production course; for other chapters see the table of content..

In production systems with machine learning components, updates and experiments are frequent. New updates to models may be released every day or every few minutes, and different users may see the results of different models as part of A/B experiments or canary releases (see QA in Production lecture). It may be hard to later figure out which model made a certain prediction or how that model was produced.

Motivating Example: Debugging a Prediction Problem

Let’s imagine we are responsible…

After teaching our Machine Learning in Production class (formerly “Software Engineering for AI-Enabled Systems”) four times, we stupidly made a decision we are going to soo regret when there are still so many chapters left: We are going to write a book with our collected material.

We will release the book publicly under creative commons license eventually. While we work on it, we are releasing individual chapters here, one at a time. We hope that more chapters below will be filled with links soon.

Table of Contents

Part 1: ML in Production: Going beyond the model

  • Introduction
  • From model to production system
  • Challenges of production machine learning systems

Part 2: Engineering Production AI Systems

  • Requirements engineering
  • 1. System…

Quality assurance of machine-learning models is difficult. The standard approach is to evaluate prediction accuracy on held-out data, but this is prone to many problems, including testing on data that’s not representative of production data, overfitting due to different forms of leakage, and not capturing data drift. Instead, many developers of systems with ML components focus on testing in production. — This post is a summary of the “quality assessment in production” lecture of our Machine Learning in Production course; for other chapters see the table of content.

Testing in Production — A Brief History

Testing in production is not a new idea and has been…

AI Alignment and Safety

Researchers in multiple communities (machine learning, formal methods, programming languages, security, software engineering) have embraced research on model robustness, typically cast as safety or security verification. They make continuous and impressive progress toward better, more scalable, or more accurate techniques to assess robustness and prove theorems, but the many papers I have read in the field essentially never talk about how this would be used to actually make a system safer. I argue that this is an example of the streetlight effect: We focus research on robustness verification, because it has well-defined measures to evaluate papers,whereas actual safety and security…

Software engineering education seems to focus a lot on process models (“software development life cycles”) like waterfall, spiral and agile. A number of models also describe the process of developing ML components (“machine learning workflow”). They don’t seem to fit well together, which causes a lot of confusion about the role that software engineering, MLOps and others play in building software systems with ML components, and where the real engineering challenges arise. Far from complete, here is the status of my current thinking… Feedback welcome.

ML Components in Software Systems

Outside of pure science projects, the ML model created with a machine learning framework is…

Christian Kästner

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store