Model Quality: Defining Correctness and Fit

All models are wrong, but some are useful. — George Box

This post is a first in a series of three that correspond to the “model quality” lectures of our Machine Learning in Production course, followed by Measuring Model Accuracy and Testing Models (Beyond Accuracy). A shorter version of these three chapters has been previously published as a blog post. For other topics see the table of content.

For data scientists, evaluations usually start with measuring prediction accuracy on a held-out dataset. While an accurate model does not guarantee that the overall system performs well in production when interacting with users and the environment, assuring model quality is an important building block in any quality assurance strategy for production systems with machine-learning components. Even without looking at data quality, system quality, and infrastructure quality (covered in other chapters), analyzing and measuring the quality of a model is challenging, in part because it differs significantly from traditional software testing.

We will proceed in three parts: First, in this chapter, we discuss more fundamentally what correctness or model bug even means to highlight the difficulties and differences as compared to traditional software testing. In the next chapter, we look at traditional measures of prediction accuracy used in data science and discuss typical pitfalls to avoid. In a third chapter, we discuss how more traditional software testing strategies can be applied for model quality assurance.

Running example: Cancer detection

Let’s consider a system that takes radiology images and information about a patient (e.g., age, gender, medical history, other tests performed) as inputs, to predict whether the patient has cancer. In a nutshell, we are learning a classifier represented as a single learned function:

function hasCancer(image: byte[][], age: int, …): boolean

In practice, such a model would be integrated into a larger system, where it has access to patient records and interacts with physicians who will confirm the final diagnosis. The model would likely support physicians to help them make better decisions, not replace them. Designing such a system so that physicians trust the system is extremely challenging as illustrated well in the recent Hello AI study by Google, and the system would likely need to explain how it made a decision and where exactly it suspects cancer, but, for now, let us just focus on the binary classification model without considering the rest of the system.

What is correctness?

It is surprisingly difficult to define what it means for a machine-learned model to be “correct” or to be “buggy”. Before we get to machine-learned models, let’s briefly look at how we traditionally evaluate the correctness of software implementations.

In traditional software, we evaluate functional correctness by comparing the software’s behavior against the intended behavior. Ideally, we have a clear specification of the intended behavior, that can be used to determine whether a computed output for a given input is correct.

Let’s take a simple, well-understood non-ML example, of computing the date of the day following the given day. This is not a particularly challenging problem, but also not entirely trivial given months of different lengths and the rules for leap years in the Gregorian calendar.

/**
* Given a year, a month (range 1–12), and a day (1–31),
* the function returns the date of the following calendar day
* in the Gregorian calendar as a triple of year, month, and day.
* Throws InvalidInputException for inputs that are not valid dates.
*/
def nextDate(year: Int, month: Int, day: Int) = …

The process of checking an implementation against a specification is known as verification, in contrast to validation which checks whether the specification or implementation meets the customer’s problem and needs. These two terms are standard in the software engineering literature, but also heavily overloaded with different meanings in other contexts.

Verification relates an implementation to a specification. Validation compares the specification (or the corresponding implementation) to the user’s needs.

Testing is the most common approach for verifying that an implementation correctly implements a specification: To test the software, we run the function with provided inputs in a controlled environment and check whether the outputs match the expected outputs, according to the specification. If any test case fails, we have found a bug, assuming the test matched the specification.

@Test
void testNextDate() {
assert nextDate(2010, 8, 20) == (2010, 8, 21);
assert nextDate(2024, 7, 15) == (2024, 7, 16);
assert nextDate(2011, 10, 27) == (2011, 10, 28);
assert nextDate(2024, 5, 4) == (2024, 5, 5);
assert nextDate(2013, 8, 27) == (2013, 8, 28);
assert nextDate(2010, 2, 30) throws InvalidInputException;
}

We can usually not test all possible inputs, because there are simply too many. So, while a good testing regime might give us some confidence in correctness, it will not provide guarantees and we may miss certain bugs in testing. There are formal techniques that can prove that a program matches the specification for all inputs, but adoption of formal methods is rare and mostly restricted to few and small highly critical software components, due to substantial costs. Functional correctness of most software is evaluated through testing.

Importantly, we have strong correctness expectations. Specifications determine which outputs are correct for a given input. We do not judge the output to be “pretty good” or “95% accurate” or are happy with an algorithm that computes the next day “correctly for 95% of all inputs in practice.” We have no tolerance for occasional wrong computations, for approximations, or for nondeterministic outputs unless explicitly allowed by the specification. A single wrong computation of a date would be considered a bug. In practice, developers may decide not to fix certain bugs because it is not economical in some settings, but instead accept that users cope with a buggy implementation — but we still consider the wrong computation as a bug.

Correctness of Models without Specifications

How we evaluate machine-learned models is very different from traditional software testing. Generally speaking, a model is also just an algorithm that takes inputs to compute an output; it’s just that this algorithm has been learned from data rather than manually coded.

/**
* ????
*/
function hasCancer(scan: byte[][]): boolean;

The key issue is that we have no specification that we could use to determine which outputs are correct for any given input. We use machine learning precisely because we do not have such specifications, typically because the problem is too complex to specify or the rules are simply unknown. We have ideas, goals, and intuitions, say “detect whether there is cancer in the image”, which could be seen as implicit specifications, but nothing we can write down as concrete rules as to when an output is correct for any given input.

Importantly, we usually accept that we cannot avoid some wrong predictions. When using machine learning, we cannot and do not expect all outputs to be correct. A single wrong prediction is not usually not a reason to not use a model, whereas in traditional testing we’d reject the model from a single wrong prediction. Failing the model due to a single wrong prediction seems wrong:

@Test
def
testPatient1() {
assertEquals(true, hasCancer(loadImage("patient1.jpg")));
}
@Test
def
testPatient2() {
assertEquals(false, hasCancer(loadImage("patient2.jpg")));
}

Overall, we have weak correctness assumptions. We evaluate models with examples, but not against specifications of what the model should do. We accept that mistakes will happen and try to judge the frequency of mistakes and how they may be distributed. A model that’s correct on 90% of all predictions might be pretty good.

Evaluating Model Fit, not Correctness (“All Models are Wrong”)

A better way to think about how to evaluate models is to think about models not as correct or wrong or buggy, but about whether they fit a problem. As stated, we do not usually expect a model to always make correct predictions, but we accept a certain level of incorrect outputs. The key question is whether the model’s predictions, in the presence of occasional mistakes, are good enough for practical use.

A good way of thinking about this is the aphorism “All models are wrong, but some are useful”, generally attributed to statistician George Box (see Wikipedia for an extended discussion of quote):

All models are approximations. Assumptions, whether implied or clearly stated, are never exactly true. All models are wrong, but some models are useful. So the question you need to ask is not “Is the model true?” (it never is) but “Is the model good enough for this particular application?” — George Box

While the quote maps nicely to machine-learned models, it originally referred to modeling in science more broadly, for example, in creating models of real-world phenomena, such as gravitational forces. An illustrative example is Newton’s laws of motion e.g., “the rate of change of momentum of a body over time is directly proportional to the force applied, and occurs in the same direction as the applied force: F = dp/dt.” These models provide general explanations of observations and can be used to make predictions about how objects will behave. In fact, these laws have been experimentally verified repeatedly over 200 years and have been used for all kinds of practical computations and scientific innovations. However, the laws are known to be approximations that do not generalize for very small scales, very high speeds, or in very strong gravitational fields. They cannot explain semiconductors, GPS errors, superconductivity, and many other phenomena, which require general relativity and quantum field theory. So, technically, we could consider Newton’s laws as wrong as we can find plenty of inputs for which they predict the wrong outputs. At the same time, the laws are incredibly useful for making predictions at scales and speeds of everyday life.

Similarly, with machine-learned models, we do not expect to learn models that are “correct” in the sense that they always produce the correct prediction, but we are looking for models that are useful for a given problem, that are mostly making right predictions for inputs that matter.

This still raises the problem of how to evaluate usefulness or fit. Clearly, a model fits better when it makes more correct predictions; such a model is likely more useful. But what level of fit is sufficient? How many and what kind of wrong predictions can we accept? This clearly depends on the problem.

Another George Box quote also points us toward how to think about evaluating fit:

“Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” — George Box, 1976

That is, a model must always be evaluated in the context of the problem we are trying to solve. We must also realize that not all mistakes are equal and some are much more problematic than others. If we could understand under which situations we could rely on a model (e.g., not to use Newton’s laws near the speed of light, not to use the cancer detector on children), we might be able to trust the model more.

Deductive vs. Inductive Reasoning

Another framework that can help us clarify the thinking about evaluating model quality is the distinction between deductive and inductive reasoning.

(Daniel Miessler, CC SA 2.0)

Classic programming and quality assurance are deeply rooted in deductive reasoning. Deductive reasoning is the form of reasoning that combines logical statements following agreed-upon rules to form new statements. It allows us to prove theorems from axioms. It allows us to reason from the general to the particular. It is a kind of mathematical reasoning, familiar from formal methods and classic rule-based AI expert systems; the kind of reasoning we use for analyzing program behavior; the kind of formality we use to write specifications and determine whether a program violates them.

In contrast, machine learning exploits inductive reasoning, where we try to generalize from observations. Overall, we reason from particular observations to general rules. Instead of defining rules and axioms upfront, we try to infer them from data. Strong evidence in the data suggests a rule but does not guarantee that the rule will hold generally. This is the kind of scientific reasoning familiar from the natural sciences where we try to discover the laws of nature, often with our tools in statistics.

The inductive reasoning in machine learning is why a hard notion of correctness is so hard to come by and this is why we should reason about fit and usefulness.

Verification vs. Validation

As discussed above, software testing is a verification approach that compares an implementation against a specification. In traditional software engineering, verification is only part of the story: Validation is the other part of understanding whether the specification (or the software that is verified to meet the specification) fits the problem.

It is plausible to write a program that perfectly meets a specification (i.e., all tests pass, maybe it’s even formally verified), but that is entirely useless when the specification does not match the problem. For example, implementing the nextDate function for the Julian calendar may be entirely correct according to the specification of that calendar, but if we want to use it for date calculations for dates in the Gregorian calendar (different primarily in the handling of leap years), it is not particularly useful. Validation problems are often more subtle, where the specified function does not fully meet the user’s needs or ignores important assumptions about how the system is used. For example, the flight software of the Ariane 5 rocket was implemented correctly as specified but did not account for the higher speed of the rocket that pushed software inputs beyond the expected and safe boundaries, leading to the accident at the 1996 launch that resulted in the destruction of the rocket.

Validation problems are typically in the realm of requirements engineering: We need to find out what the needs of the users are, as well as environmental interactions, and potential legal and safety constraints, in order to then specify a solution that best fits the needs. Users often have conflicting needs and a requirements engineer will have to resolve them and come up with a specification that best fits the needs of most users (or of the important users) and that also meets other constraints. Validation is essentially about judging the fit of a system for a problem.

Interestingly, machine learning has a very similar role compared to requirements engineering in traditional software engineering. Both perform inductive reasoning: Requirements engineers identify the specification that best matches a user’s needs from interviews and other inputs and machine learning identifies rules in data that best fit the data. We might check the learned model against other constraints that we may have, such as fairness constraints and various invariants (see below), but the main part of evaluating a model is checking how well it fits the problem, not whether the model meets a specification. In this context, questions like whether the training data was representative, sufficient, and of high quality become really important.

A possible interpretation here is to see the machine-learned model as the specification that is derived from data. Going from the model to an implementation is typically straightforward as the corresponding code is generated and we do not really need to be concerned about implementation problems that our traditional verification tools could find. Instead, we should focus on the validation challenge of ensuring that the model fits the problem and possibly the challenge of whether the model and other specifications are compatible or consistent. One of the essential new challenges, compared to traditional requirements engineering, is that most machine-learned models are not easily interpretable and way too complex for humans to easily judge whether they fit a problem.

Note that we might be worried about whether the machine-learning algorithm that learns a model is implemented correctly. Indeed many bugs have been found in machine learning libraries (just like how many bugs have been found in compilers). The machine-learning algorithms are typically well-specified and implemented manually; though clearly far from trivial, especially with nondeterminism in many learning algorithms, standard software testing techniques can be used to verify their correctness. In this book, we do not focus on testing the machine learning algorithm, but on evaluating the models they produce.

On Terminology: Bugs, Correctness, and Performance

We recommend avoiding the term model bug and avoiding talking about the correctness of a model, because it brings too much baggage from traditional software testing and deductive reasoning that does not easily match to evaluating models, given that a single wrong prediction is not usually a problem in isolation. Instead, terms that refer to the fit or accuracy of a model are better suited.

Machine learning practitioners usually speak of the “performance of a model” when they refer to prediction accuracy. For example, “one model performs better than another” indicates that the first model produces more accurate predictions, on average, according to some evaluation.

We avoid the term “performance” in this book, because it has very different meanings in different communities and can cause confusion when used casually in interdisciplinary collaborations. Most software engineers think about the execution time of an algorithm when talking about performance. Yet other communities have yet very different meanings, such as performance in art, job performance and company performance in business, or performance test (bar exam) in law.

We deliberately use the term prediction accuracy to refer to how well a model does when evaluated on test data and time or latency when speaking about durations, such as learning time or inference latency.

Aside: Performance testing (execution time). It may be appealing to compare model evaluation to testing execution time of an algorithm in classical software engineering. Tests of execution time usually do not come with hard specifications, but mostly with partial specifications of upper bounds. More commonly, we add only regression tests, ensuring that execution time does not degrade as the system evolves or we simply just benchmark alternative implementations without having a specific target. Timing behavior is not usually deterministic, hence we often average over repeated executions (e.g., “90% of all executions shall terminate in 1s”).

@Test(timeout=100)
def testCompute() {
expensiveComputation(…);
}

While there are some similarities in the absence of hard specifications, the analogy to evaluating execution time is not a good fit though, because the sources of uncertainty differ between a software’s execution time and a machine-learned model’s accuracy. In testing execution times, we need to manage nondeterminism within the program or environment, whereas most machine-learned models are deterministic in their predictions (i.e., the learning algorithm is often non-deterministic, but most models are deterministic). Hence, in timing evaluations, we repeatedly execute the same program with the same inputs, whereas in machine learning we average across predictions of different inputs, not repeated predictions for the same input.

Summary

In the absence of specifications, model quality always refers to fit, not correctness. We do not reject an entire model because we disagree with a single prediction; instead, we accept that some predictions will be wrong and rather try to quantify how often a model is wrong for data in our problem. That is, an evaluation will determine whether a model fits a problem and is useful for solving a problem. This is a significant difference from how software is traditionally tested and is much more similar to software validation activities like acceptance and user testing.

In the next chapter, we’ll look at traditional accuracy evaluation strategies that aim to measure fit, and their pitfalls, before we look into how traditional software testing techniques can be adopted for a machine learning context.

Readings and References

As all chapters, this text is released under Creative Commons 4.0 BY-SA license.

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling