Model Quality: Slicing, Capabilities, Invariants, and other Testing Strategies

34 min readMar 25, 2021

This post contains the third and final chapter in a series on the “model quality” lectures of our Machine Learning in Production course. The previous chapters discussed what correctness actually means for a machine-learned model (link) and how data scientists traditionally use various accuracy measures to evaluate model quality on test data (link). This chapter focuses on more advanced testing strategies, many developed very recently, inspired by testing techniques developed for traditional software. For other topics see the table of content. A shorter version of this content has been previously published as a blog post and a longer exploration of capability testing in another blog post.

For data scientists, evaluations usually start with measuring prediction accuracy on a held-out dataset, but many other testing techniques can be applied to provide a more nuanced view of quality, to assure specific important invariants, evaluate quality on different subsets of the target population, or even ensure that the model has learned specific capabilities we intend it to learn, which might help to generalize better to out-of-distribution data. In this chapter we focus on two broad themes: (1) curating validation datasets through slicing and generators and (2) automated testing with invariants and simulation. We conclude with a brief discussion of adequacy criteria (or the lack thereof).

Curating Validation Sets

As discussed in the first model quality chapter, model quality does not match the traditional view of software verification, where we test whether an implementation meets a specification, but instead, we are measuring how well a model fits a problem. Nonetheless, there are many parallels to software testing that provide some insights and strategies of how to evaluate fit beyond just standard prediction accuracy measures on a single test set. Here we focus on how to (manually) curate test data, before we look at automated testing in the next section.

Motivating example

As a motivating example, let’s go back to our nextDate function from the previous chapter that computes the date of the day following a given date. The problem is well specified, even if we may not have a written formal specification at hand.

def nextDate(year: Int, month: Int, day: Int) = …

A developer tasked with testing this function would not create a series of test cases by providing inputs and expected outputs. For example:

assert nextDate(2021, 2, 8) == (2021, 2, 9);

We could certainly not provide tests for all possible inputs, so testers need to select which inputs to try to gain confidence in the correct implementation of the function. If this was a machine-learned model, we would randomly sample from the distribution of relevant inputs. For example, we could randomly pick 500 dates between January 1st, 1900 and December 31st, 2099. When manually writing tests, software testers usually do not pick random inputs though, but purposefully select inputs to test different facets of the program.

Typically a tester would analyze the specification and create tests to gain confidence in different parts of the specification. For example, on top of testing a simple example like above about correctly incrementing the day within a month, we may test that the transition from the last day of a month to the first of the following month, the transition of the last day of the year to the next, and the handling of leap year exceptions. We might also try invalid inputs, like negative numbers for days or try unusual years very far into the past or future. If we are suspicious about parts of the implementation, we might even test potential anticipated problems (e.g., the Year 2038 problem). We might create one or a few test cases per each part of the specification we consider, but likely won’t test it for every single year or every single month of a day and won’t just pick inputs randomly.

Many strategies for creating test cases effectively have been researched and codified in the software engineering community, some of which also are useful in a machine learning context.

Not every input is equal

Traditional accuracy evaluation in machine learning is done with a single dataset containing many samples drawn from the same distribution as the training set, which hopefully matches the distribution in production. However, not all inputs in that distribution may be equally important for the problem we are trying to solve and not all mistakes are equally costly. In addition, the way that many datasets are constructed, not all concerns or groups may be equally represented.

This idea that not all mistakes of a model are equally important is also reflected in the George Box quote mentioned earlier:

“Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” — George Box, 1976

Important and less important mistakes: Consider the case of a voice assistant. There are commands that are way more commonly used by users, such as “What’s the weather tomorrow” or “Call mom”, whereas mistakes on less common phrases, say “Add asafetida to my shopping list”, might be more forgivable. Thus, instead of treating all inputs as equal, we should identify important use cases that may be worth testing separately and make sure that a model update does not break them. In the cancer detection scenario, we might expect that certain kinds of cancer are more important to detect than other more treatable kinds.

Fairness and systematic mistakes: A vast amount of literature on fairness in machine learning (see also fairness chapter) has pointed out bias in models and systems. In the same voice assistant setting, systems make many more mistakes when speakers are from a minority group (e.g., reported as 35% more misidentified words from people who were black; see New York Times article). In the cancer detection example, accuracy may severely differ across gender, age, and ethnic groups. Minorities are often underrepresented in the training and test data (often even more so than they are underrepresented in the user population in production) and such disparities may not even be noticed.

In general, systematic mistakes of a model are problematic even if they only affect few users: Even if the system works well on average, a model that does not only make random mistakes that affect everybody equally, but makes certain kinds of mistakes regularly on certain kinds of inputs can be highly problematic. For example, the voice assistant may not detect most Asian cooking ingredients, a smart door camera may not detect short people, a spam filter systematically deletes security alerts, or the cancer detection system misses certain classes of rare cancers or may not work at all for patients from certain minority populations. In all these cases, mistakes happen frequently for subpopulations but may be barely noticeable with an overall evaluation across a large randomly drawn dataset. We may not even know about these cases until we observe them in production, because they were simply not part of our test data.

Setting goals: Finally, we may know that certain classes of predictions are particularly challenging and we do not expect to perform well on them now, but hope to improve in the future. In this case, it may be useful to collect specific test sets to evaluate stretch goals. For example, we might track voice recognition quality on particularly noisy audio or track cancer detection accuracy for hard to detect early stage cancer — -even though we accept low accuracy on those results for now, it is still useful to track improvement for these difficult tasks.

Input slicing

After realizing that not all inputs are equal, the question is how to test for mistakes for certain kinds of inputs. One way is to identify subpopulations of interest. Importantly, we track a group of inputs for a subpopulation (drawn from a select part of the target distribution) as a test set, rather than a single test input as in traditional testing and rather than averaging over all test data as in traditional accuracy evaluations. For each test set, we can then identify accuracy with some suitable accuracy metric separately.

The question now is how to identify and curate suitable test sets.

A good way to approach looking for important subpopulations and imbalances between subpopulations is to slice (or partition) the entire input distribution and curate samples for every partition. If necessary, we may need to collect more data for certain partitions than we usually would if we just randomly sampled from the target population.

For example, in cancer prediction, we might slice the test data of patients by gender, age group, and ethnicity and evaluate accuracy for each subpopulation separately. For voice assistants, we might slice by gender, dialect, or problem domain (e.g., shopping vs weather vs scheduling). We can also look at the interaction of multiple slicing criteria, such as evaluating the accuracy of cancer detection for young hispanic women.

The criteria used for slicing will differ depending on the use case. For fairness, we will usually slice the data by demographic attributes of the user like gender, ethnicity, or age group. We can also use aspects of the problem or business criteria, such as the kind of cancer to be detected or the cost of a false prediction.

The mere act of slicing the test data and looking at results for subpopulation will often help to identify problems and might encourage a team to plan for mitigations, such as anticipating potential problems, collecting more training and test data for neglected subpopulations or adjusting how the system uses predictions for these subpopulations to compensate for the reduced confidence (e.g., prompt rather than automate).

For example, Barash et al. at IBM explored how a sentiment analysis classifier works on movie reviews from IMDB and found that predictions for reviews of older movies were less accurate, but also that there was less support for that accuracy measure because the test data contained only few such reviews (paper). This could be a useful starting point to investigate causes of the accuracy differences and to encourage curating more test (and training) data for less supported groups if they are considered important. They show similarly how this approach can be extended to interactions of multiple slices, such as movie genre, rating category, and review length.

Accuracy results from a sentiment analysis tool for movie reviews, sliced by decade of the release of the movie that was reviewed. From Barash, Guy, Eitan Farchi, Ilan Jayaraman, Orna Raz, Rachel Tzoref-Brill, and Marcel Zalmanovici. “Bridging the gap between ML solutions and their business requirements using feature interactions.” In Proc. Symposium on the Foundations of Software Engineering, pp. 1048–1058. 2019.

Accuracy results from a sentiment analysis tool for movie reviews, showing results for the interactions among partitions for genre, rating, and length. From Barash, Guy, Eitan Farchi, Ilan Jayaraman, Orna Raz, Rachel Tzoref-Brill, and Marcel Zalmanovici. “Bridging the gap between ML solutions and their business requirements using feature interactions.” In Proc. Symposium on the Foundations of Software Engineering, pp. 1048–1058. 2019.

Conceptually, this slicing of the input space is very similar to the traditional testing strategy of equivalence class testing. When using this technique, one analyzes the specification of the program to group possible inputs into equivalence classes along multiple dimensions. By analyzing the specification and domain knowledge around the nextDate example, one might group the set of all possible dates into 8 groups: negative numbers (e.g., -1), 0, 1–27, 28, 29, 30, 31, and numbers larger than 31. Furthermore, one might also group years into leap years and non-leap years and into pre 2038 and post 2038 years, and one may group months by length, and so forth. Generally, equivalence classes can be derived from specifications in different ways: by tasks, by input ranges, by error conditions, by expected faults, and so forth. In traditional software testing, a tester then creates a test case for each group (known as weak equivalence class testing) or each interaction of groups (known as strong equivalence class testing). In a machine learning context, instead of creating a single test each, we would create a test set with multiple labeled inputs for each slice to measure prediction accuracy in each slice.

In addition, combinatorial testing and decision tables are common techniques to explore interactions among inputs in test data that may provide some inspiration on how to design test data sets without exploring all combinations of all slices.

Real-world application: The Overton system at Apple. Apple internally uses a system called Overton that focuses a data scientist’s attention on improving model accuracy for different slices of the input space, one slice at a time. It provides data scientists with ways to programmatically slice data based on different criteria and creates separate accuracy reports for each slice. Data scientists would then typically focus on improving the model for one slice at a time. For example, they could investigate particularly challenging or underperforming tasks in voice assistants, typically by improving training data for that slice (these systems primarily use weak supervision, so training data is primarily improved by adding more labeling functions to label more production data for training). The data scientists could write a slicing function to detect “nutrition-related queries” or “queries with complex disambiguations” among the inputs. On finding slices with low accuracy results, they might then invest efforts in providing better labeled training data for that slice. That is, this approach allows the data scientists to focus on the more tractable problem of improving accuracy for individual slices of the user inputs, rather than trying to improve the model for the entire input space at once.

*Overview of the Overton system from the paper Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. “Overton: A Data System for Monitoring and Improving Machine-Learned Products.” arXiv preprint arXiv:1909.05372 (2019).*

Testing Model Capabilities

Another strategy to curate test data that recently received increasing attention in the machine learning community, especially in natural language processing, is to test specific capabilities of a model. Rather than sampling test data from the target distribution, the idea is to identify capabilities of the model and then curate data to specifically test that capability. This is essentially a form of unit testing, where test data sets are created to test specific aspects of the model.

In the context of evaluating models, capabilities refer to tasks or concepts that we expect a model to learn as part of a larger task. For example, in a sentiment analysis task, we would expect a model to be able to handle negation (e.g., “the food was great” vs “the food was not great”) or learn to understand that the author’s sentiment is more important than that of other actors in the sentence (e.g., “Tom liked the food, but I thought it was really bad”). We might expect our cancer detector not to be distracted by certain dark patterns in scans that come from known technical issues of the scanning equipment, or we might expect it not to depend on the darkness or contrast of the picture overall but on the shapes of darker areas as physicians do. We would expect that inputs that require these capabilities would naturally occur in the target distribution, but by teasing them apart and specifically testing capabilities, we can focus on understanding whether the model learns specific concepts we might be expecting it to learn.

Once we identify the capability of interest, we can curate suitable test data. If enough test data is available, we might simply slice it from the general pool of test data as discussed above. More commonly, we will specifically curate data to test a capability. For example, we could generate lots of sentences with negations of positive sentences and negations of negative sentences and maybe even sentences with double negations to test the model’s capability of understanding negation. Similarly, we can specifically create sentences where the author and other subjects have differing sentiments to test whether the model correctly learned the concept of prioritizing the author. We can also create variations of scans in the cancer detection scenario by adding artificial noise to parts of the scan or adjusting the brightness. Again, we are not looking for a single incorrect prediction, but we want to curate test sets for each capability consisting of multiple inputs — -especially those inputs that are challenging to get right without having learned the target capability.

Researchers have demonstrated success in testing certain capabilities by automatically generating test data and corresponding expected labels with more or less sophisticated generators. For example, we can simply insert negations in existing negative, positive, or neutral sentences following standard grammatical rules and expect the sentiment to flip. Alternatively, we can also instruct crowd workers to create test sentences, for example by creating new or modifying existing sentences in specific ways (e.g. “starting with the sentence above explaining somebody’s opinion of a movie, extend the sentence to express that you disagree and have the opposite opinion”). What approaches and what degree of automation is feasible for creating test data depends on the prediction problem and the capability to be tested.

Identifying capabilities. The key challenge with testing capabilities of a machine learning model is to identify the relevant capabilities. Here, we are back to the problem of not having clear specifications. Still, for many problems we may have a good conceptual understanding of ingredients that are likely needed to make a model perform a task similar to humans. For example, in their ACL 2020 paper, Ribeiro et al. mention among others the following capabilities of for a commercial sentiment analysis tool:

Vocabulary+POS: The model has the necessary vocabulary and can handle words in different parts of the sentence correctly.
Sentiment: The model understands the most common words that carry positive and negative sentiment.
Taxonomy: The model correctly handles synonyms, antonyms, etc.,
Robustness to typos: The model is not thrown off by simple typos like swapping two characters, e.g., “no thakns”
Irrelevance: The model correctly ignores irrelevant parts of sentences, such as URLs.
Named entities: The model correctly identifies and handles named entities; for example, that switching two locations or people shall not affect the sentiment of the overall sentence, e.g., “I miss the #nerdbird in San Jose” vs “I miss the #nerdbird in Denver”
Semantic role labeling: The model recognizes the author’s sentiment as more important than that of others, e.g. “Some people think you are excellent, but I think you are nasty”
Fairness: The model does not associate sentiment just based on gendered parts of sentences
Ordering: The model understands the order of events and its relevance on the sentiment of the sentence, for example, it should detect statements about past and current opinions and report the current as the sentiment, e.g., “I used to hate this airline, although now I like it.”
Negation: Ability to correctly detect how negation in sentences affect sentiment, e.g., “It isn’t a lousy customer service”

To identify capabilities, it may be useful to understand key aspects of the task and how humans would perform it. For example, for the list above, a good understanding of linguistics basics is likely helpful. Interviewing physicians for how they identify cancer, especially in cases where models are confused might reveal important capabilities humans use for the task. In addition, identifying common patterns in faulty predictions can help to understand the kinds of mistakes a model commonly makes (e.g., misunderstanding double negations, shortcut reasoning) and identifying capabilities that a model should learn that would avoid these kinds of mistakes.

Comparing capabilities again to traditional unit tests, capabilities can relate to simple tests of minimal functionality, like a simple test establishing that the main functionality indeed works, or they can test specific challenging parts of the problem, like handling expected corner cases. One can even test anti-patterns of shortcut behavior the model should not exhibit, once they are identified as a common source of wrong predictions. Traditional test creation techniques of analyzing parts of the specification may provide inspiration, even when we do not have a clear specification of the problem (e.g., analyzing theories of how humans approach the problem or analyzing common mistakes). For example, boundary value analysis is a classic testing technique to analyze the boundaries of the specified behavior and create test cases specifically at those boundaries (e.g., beginning and end of the month for nextDate) or beyond the boundaries (e.g., 0 or 13 for months in nextDate). These methods might provide inspiration for how to look for corner cases and challenging concepts to test as capabilities in ML.

Generalizing beyond the training distribution? As discussed above, we cannot typically expect a model to generalize beyond the distribution of the training data, even though this is something we often would like to do, to be more robust to data drift and malicious inputs.

An interesting potential benefit of testing capabilities is that it may help us to select models that generalize better to out-of-distribution data. The idea, captured at length in a paper by D’Amour et al. is that among many possible models that have similar accuracy on a traditional accuracy evaluation of test data from the same distribution as the training data, those models that have learned intended capabilities are more robust to understand out-of-distribution data. The intuition is that those capabilities mirror human concepts or important principles, which humans have identified as important to the general problem; hence, these capabilities may generalize better than whatever other rules the machine learning algorithm might infer during training. This justifies curating test data that does not correspond to the training distribution, but instead relates to specific capabilities we expect to be important for predictions beyond the current target population.

Technically, we may not care in most scenarios whether the model’s learned internal reasoning is similar to that of a human. Often machine-learned models may outperform humans because they pick up on signals that humans would not see. In cases like label leakage or biased training and test data (see previous chapter) these shortcuts are not useful in production, but many other detected patterns in the data might provide truly useful insights that a human would miss. Testing for capabilities helps us to distinguish models that learn structures closer to human concepts that we encode in those tests. Capabilities can be seen as partial specifications of what to learn in a model for a problem, it can be seen as a way of encoding domain knowledge. Whether focusing on capabilities will indeed help us to find models that generalize better to out-of-distribution data remains to be seen, but it is a promising and plausible direction to shape the structures to be learned by a model.

Automated (Random) Testing and Invariants

Generating random inputs to test a program is easy, and there is much software engineering research on fuzz testing and other test case generation techniques, which have had lots of success in finding certain kinds of bugs in many programs. For example, CSmith has found hundreds of compiler bugs by feeding random programs into compilers. The key challenge with all test-case generation is identifying when a generated input has actually identified faulty behavior.

Generating inputs is easy. For example, for the nextDate function, we could simple generate lots of random inputs, by picking random integers for the day, month, and year:

@Test
void testNextDate() {
  nextDate(488867101, 1448338253, -997372169)
  nextDate(2105943235, 1952752454, 302127018)
  nextDate(1710531330, -127789508, 1325394033)
  nextDate(-1512900479, -439066240, 889256112)
  nextDate(1853057333, 1794684858, 1709074700)
  nextDate(-1421091610, 151976321, 1490975862)
  nextDate(-2002947810, 680830113, -1482415172)
  nextDate(-1907427993, 1003016151, -2120265967)
}

Randomly generated tests are not necessarily representative of real test data and we might mostly only hit the input validation part of the program rather than its actual functionality. However, we can generate tests randomly from a more realistic distribution of inputs by, for example, randomly drawing year, month, and day separately from smaller ranges of numbers:

@Test
void testNextDate() {
  nextDate(2010, 8, 20)
  nextDate(2024, 7, 15)
  nextDate(2011, 10, 27)
  nextDate(2024, 5, 4)
  nextDate(2013, 8, 27)
  nextDate(2010, 2, 30)
}

Similarly, we can also generate random inputs for a machine learning model. These inputs can be entirely random or are generated to be more realistic for the target distribution:

@Test
void testCancerPrediction() {
  cancerModel.predict(generateRandomImage())
  cancerModel.predict(generateRandomImage())
  cancerModel.predict(generateRandomImage())
}

The challenging step in this approach is not the task of generating the inputs, but knowing what to expect to happen for given test inputs. Is the nextDate computation correct for the generated inputs? In the machine learning case, the problem is knowing to expect what the outcome of the prediction should be, i.e., the label for the generated images. This is known as the oracle problem. Without knowing what to expect, we won’t know whether the test case fails or succeeds or how to measure the accuracy of a model. The oracle problem affects both traditional software and machine learning models.

The Oracle Problem

The oracle problem is the problem of knowing whether a program behaves correctly for a specific input. That is, for software testing and model evaluation, we can easily produce millions of inputs, but how do we know what to expect as outputs? Does the test pass? Was the model’s prediction accurate?

There are a couple of common strategies to deal with the oracle problem in random testing:

Manually specify outcome — Humans can often provide the expected outcome based on their understanding or based on the specification of the problem, but this obviously does not scale when generating thousands of random inputs and cannot be automated. While tedious, humans could potentially provide the expected outcome for the nextDate and cancer detection problems with random inputs. But even when crowdsourcing the labeling task in a machine-learning setting, for many problems it is not always clear that humans would be good at providing labels for “random” inputs — -especially those that do not look similar to common production data. In machine-learning settings, we again suffer from the lack of a specification that we could use to determine the expected outcome.

@Test
void testNextDate() {
  assert nextDate(2010, 8, 20) == (2010, 8, 21);
  assert nextDate(2024, 7, 15) == (2024, 7, 16);
  assert nextDate(2011, 10, 27) == (2011, 10, 28);
  assert nextDate(2024, 5, 4) == (2024, 5, 5);
  assert nextDate(2013, 8, 27) == (2013, 8, 28);
  assert nextDate(2010, 2, 30) throws InvalidInputException;
}float evaluateCancerPredictionAccuracy() {
  // random inputs
  xs = [loadImage("random1.jpg"), loadImage("random2.jpg"), loadImage("random3.jpg")];
  // manually provided labels
  ys = [true, true, false]; 
  return accuracy(cancerModel, xs, ys)
}

Comparing against a gold standard — If we have an alternative implementation (typically a slower but known to be correct implementation that we want to improve upon) or some form of executable specification, we can use those to simply compute the expected outcome. For example, we may have a reference implementation of nextDate in a library against which we could compare our own implementation. Even if we are not perfectly sure about the correctness of the alternative implementation, we can use it to identify and investigate discrepancies or even vote when multiple implementations exist. Such forms of differential testing have been extremely successful, for example, in finding compiler bugs with CSmith. Unfortunately, we usually use machine learning exactly when we do not have a specification and do not have an existing reference solution; for example, it is unlikely that we’ll have a gold standard implementation of a cancer detection algorithm.

@Test
void testNextDate() {
  assert nextDate(2010, 8, 20)==referenceLib.nextDate(2010, 8, 20);
  assert nextDate(2024, 7, 15)==referenceLib.nextDate(2024, 7, 15);
  assert nextDate(2011, 1, 27)==referenceLib.nextDate(2011, 1, 27);
  assert nextDate(2024, 5, 4) ==referenceLib.nextDate(2024, 5, 4);
  assert nextDate(2013, 8, 27)==referenceLib.nextDate(2013, 8, 27);
  assert nextDate(2010, 2, 30)==referenceLib.nextDate(2010, 2, 30)
}float evaluateCancerPredictionAccuracy() {
  imgs = ["random1.jpg", "random2.jpg", "random3.jpg"];
  correct = 0
  for (img <- imgs)
    if (cancerModel.predict(loadImage(img)) == 
                        otherModel.predict(loadImage(img)))
      correct += 1
  return correct / img.length
}

Checking partial specifications — Even when we do not have a complete specification of a problem (e.g., the expected outcome for every input), we sometimes have partial specifications or invariants. A partial specification describes some aspect of the behavior that we want to assure, even if we do not know the expected output of a computation; for example, we might know that we only accept positive inputs or that all outputs are non-empty lists. Some partial specifications specify global invariants about program internals, such as, the program shall never crash, the program shall not access uninitialized memory, all opened file handles shall be closed before the program terminates, or integers shall never overflow. Most fuzz testing approaches look for crashing bugs or violations of other global invariants, such as unsafe memory access. That is, instead of checking that, for a given input, the output matches some expected output, we only check that the computation does not crash or does not violate a given partial specification. Developers can also manually specify partial specifications using assertions, which are typically specified as part of pre- and post-conditions of a function or a loop invariant (though developers are often reluctant to write even those). Assert statements convert runtime violations of these partial specifications into crashes or exceptions. Notice that partial specifications focus only on a certain aspect of correctness that should hold for all executions, and may not say anything about whether the program actually produces a desired output.
In a machine-learning setting, we are usually not interested in bugs that crash models or throw exceptions, as this is a rare occurrence. Instead, we are mainly concerned about models that return a wrong result for a given input. As mentioned before, most machine learning problems lack a clear specification. However, there are some partial specifications we can define for models that may be worth testing for, including partial specifications for capabilities, fairness, and robustness, which is where automated testing may be useful, as we will discuss below.

@Test
void testNextDate_NoException() {
  // tests that no exception is thrown for any input
  nextDate(2010, 8, 20)
  nextDate(2024, 7, 15)
  nextDate(2011, 10, 27)
  nextDate(2024, 5, 4)
  nextDate(2013, 8, 27)
}@Test
void testCancerPrediction_NoException() {
  // we don't really worry about crashing bugs
  cancerModel.predict(generateRandomImage())
  cancerModel.predict(generateRandomImage())
  cancerModel.predict(generateRandomImage())
}

Simulation and inverse computations — in some scenarios, it may be possible to simulate the world to derive input-output pairs or to derive corresponding inputs for randomly selected outputs. For example, when performing prime factorization, it is straightforward to pick a set of prime numbers (the output) and compute the input by multiplying them. In a machine-learning vision setting, one could create a scene in a raytracer (with known location of objects) and then render the image to create the input for a vision algorithm to detect those images. Cases where deriving inputs from outputs are easier than the other way around do not appear to be very common, but this can be a very powerful strategy if it works because we can automatically generate input-output pairs.

@Test
void testFactorization_3_4_7_7_52673() {
  randomNumbers = [2, 3, 7, 7, 52673];
  assert factorPrime(multiply(randomNumbers))) == randomNumbers;
}@Test
void testScene931() {
  scene = compose(street, 
                  (car, car_coordinates), 
                  (pedestrian, ped_coordinates));
  assert pedestriantDetector.predict(render(scene)) == true;
}

Invariants and Metamorphic Testing

For evaluating machine-learned models, it seems unclear that we will ever be able to automatically generate input-output pairs outside of some simulation settings (discussed below). However, it is worth thinking about partial specifications of the model’s behavior. However in contrast to invariants that specific constraints on the behavior of a single execution, invariants used in machine-learning settings typically describe relations between predictions of different or related inputs. Here are a couple of examples of invariants for a model f:

Credit rating model f should not depend on gender: Forall inputs x and y that only differ in gender f(x)=f(y). This is a rather simplistic fairness property (sometimes called anti-classification; see fairness chapter) that does not account for correlations, but it is easy to state as an invariant.
Sentiment analysis f should not be affected by synonyms: For all inputs x: f(x) = f(x.replace(“is not”, “isn’t”))
Negation should swap sentiment of a sentence: For all inputs x in the form “X is Y”: f(x) = 1-f(x.replace(“ is “, “ is not “))
Small changes to training data should not affect outcome: For all x in our training set, f(x) = f(mutate(x, δ)). This is known as a robustness property in the literature (see safety chapter for details).
Low credit scores should never get a loan: For all loan applicants x: (x.score<645) ⇒ ¬f(x). Here, we state a sufficient condition for classification as an invariant, known as anchor in explainable machine learning (see interpretability lecture).

If we can state them, invariants are useful, because they enable automated testing. While they do not state how inputs relate to outputs generally (as discussed, we usually do not have such specifications in the machine learning context), they provide partial specifications of one desirable aspect of correctness. They will only ever assess one aspect of model quality, but that may be an important aspect.

Note that this view of invariants also aligns well with viewing model quality as a validation rather than a verification problem: In traditional requirements engineering it is common to check whether requirements are compatible with other requirements, often expressed as constraints or invariants. Here, we check whether a learned model is compatible with other partial specifications provided.

Metamorphic relations. For most invariants above, we do not need to know the expected output for the prediction. While some invariants state expected outputs for a subset of all inputs like the last credit score example, most invariants are stated in terms of relations of two inputs and their corresponding predictions. Importantly, for the latter, we do not need to know what the expected correct output is for any input, we just reason about the relations of inputs and observed outputs.

This pattern of stating invariants in terms of relations rather than expected outputs is known as a metamorphic relation. A metamorphic relation generally has the form

∀x. f(g(x)) = h(f(x))

where g is a function that relates the two inputs and h is a function that relates the two outputs of model f.

For example, the above fairness invariant can be stated as a function g that swaps the gender in a given input and h is simply the identity function stating that the expected prediction is the same for every pair of related inputs (without knowing what the expected prediction is). Similarly, the negation invariant has function g(x)=x.replace(“is “, “is not “) to negate the meaning of the sentence and h(x)=1-x to indicate that the expected sentiment between the two sentences should be opposite (without knowing the specific value of either).

Identifying invariants. Identifying invariants is challenging and requires a deep understanding of the problem. Many capabilities (discussed above) can be stated as invariants, for example, understanding negation, typo robustness, and gender fairness in NLP tasks.

For example, in the same ACL paper discussed above, Ribeiro et al. test the following capabilities with invariants in a sentiment detection tool:

Replacing neutral words with other neutral words should not affect the sentiment of the text, e.g., “the nightmare continues… “ vs “our nightmare continues…”
Adding URLs should not affect the sentiment of the text
Typos from swapping two adjacent characters should not affect the sentiment of the text
Changing location names should not affect the sentiment of the text
Changing person names should not affect the sentiment of the text
Adding positive phrases to a text should not make the sentiment more negative by more than 0.1; adding negative phrases to a text should not make the sentiment more positive by more than 0.1, e.g. “Great trip on 2672 yesterday…” vs “Great trip on 2672 yesterday… You are extraordinary”
Inserting negation in a sentence with neutral sentiment should not change the sentiment, e.g., “This aircraft is private” vs “This aircraft is not private”

Many more invariants around synonyms, hypernyms, antonyms, distractor phrases, ambiguity, stylistic variations, and others can be encoded for NLP problems.

All of these invariants can be expressed as metamorphic relations. The way they are written already lends itself to generate additional inputs from existing inputs by replacing words or adding sentences.

For image-related tasks, most invariants discussed in the literature relate to robustness properties, which state that the classification outcome should stay the same even when a certain amount of noise is added to the image. We are often interested in particular forms of noise and distortions that mirror problems that may happen with sensors in the real world, like adding noise to simulate blurry cameras, tilted cameras, foggy conditions, and so forth. Noise can also be added to represent potential adversarial attacks (e.g., small tweaks to many pixels or large changes to few pixels; see security lecture).

Another strategy to identify invariants is by searching for invariants the model has already learned, reviewing those, and keeping useful ones as regression tests. Those invariants are already met by definition (otherwise the search would not find them), but they may still be useful for understanding and regression testing. To search for existing invariants, in traditional software engineering research, several techniques have been proposed around specification mining, with Daikon probably being the most well known tool. In a machine learning context, the work on anchors is probably the closest: Anchors are invariants that describe input conditions that are sufficient to imply a given output; to find anchors, the system looks for patterns that are shared by many inputs that yield the same prediction. They describe invariants over input-output relationships rather than metamorphic relations in most cases. Anchors were originally proposed to explain decisions of a model and debug problems (see interpretability chapter).

*Examples of invariants detected from models in three standard machine learning problems, from Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Anchors: High-precision model-agnostic explanations.” In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.*

Generating inputs. Compared to identifying invariants, generating inputs is usually much easier.

Early fuzzing approaches simply generated random inputs, e.g., random bit sequences, but since few of these inputs represent common or valid inputs, many are already rejected in early stages of the program (e.g., nextDate(488867101, 1448338253, -997372169)), so tests rarely reach the deeper logic to test more interesting behavior. Many smarter strategies have been developed to generate better tests. For example, developers can guide the kind of inputs to be generated to restrict generation to valid inputs, for example, by providing a grammar or distributions for all inputs. At the same time, also many techniques like dynamic symbolic execution and coverage guided fuzzing will automatically analyze the implementation to identify inputs that execute more code in the implementation or specifically drive inputs toward those that reveal crashing behavior.

To generate inputs for testing machine learning invariants, we similarly many different strategies exists:

Uniformly random sampling: Generating purely random inputs; that is, uniformly sample from each feature’s domain. This approach is relatively straightforward but will typically not produce many inputs that look anything like typical inputs in the target distribution. For example, generated images will look like white noise and rarely ever anything like cancer scans. Like early fuzzing work, this strategy is likely not very effective for many problems.
Modeling input distributions: If we can describe the target distribution, we can sample from that distribution. In the simplest cases, we might specify distributions for every feature separately; for example, we may draw from realistic age, gender, and weight distributions for the cancer patients, but while each feature may be realistic on its own, the combination might not be, like the 80kg one-year-old child. We can also model probabilistic relationships between features to draw from more realistic distributions, for example, using probabilistic programming.
Learning input distributions: Instead of manually modeling distributions using domain knowledge, we can also learn them from sample data of the target population. Again, we can learn distributions for individual features or identify correlations among features. We can also use techniques such as generalized adversarial networks to learn a generator that mimics training data — well known for creating “deep fake” images.
Modifying existing inputs: Many invariants are described in terms of similarities and differences between two inputs. Here, it is often straightforward to simply take an input from an existing test set and modify it to create one or more additional inputs, following one of many patterns.
Adversarial learning: Many adversarial learning techniques are specialized in searching the input space, usually using gradient descent, for inputs that maximize a goal, where specification violation can often be encoded as a goal. For example, classic adversarial learning techniques try to find inputs that violate a robustness property, which states (expressible as metamorphic relation) that a model should make the same prediction when two inputs are very similar. This is conceptually similar to test case generation strategies based on dynamic symbolic execution or coverage guided fuzzing in traditional software testing.
Formal verification: For some properties and models, it may be possible to formally prove certain invariants. For certain kinds of models, certain kinds of proofs are actually fairly straightforward (e.g., proving that no applicant with a credit score below 645 will be approved for credit on a decision tree model or that no decision in a decision model depends on the gender of the applicant). For others, such as robustness properties on deep neural networks, automated formal verification techniques are an area of active research. If a property cannot be proven, some formal verification techniques provide counterexamples that can be inspected or used as tests.

The invariants relating to capabilities for NLP models above are all amenable to transformations of existing inputs, but it would also be possible to generate plausible input sentences using statistical language models to imitate common inputs as done in this stress testing paper.

Invariant violations. Most invariants in machine-learning contexts are not intended as strict invariants, where we would reject a model as faulty if the model’s prediction violates the invariant for even a single input or input pair. Instead, one would typically generate a realistic invariant over a target distribution and then measure the frequency of invariant violations as an accuracy measure. For invariants representing capabilities, we can interpret such accuracy measures as how well the model has learned the capability, similarly to how we would analyze accuracy on curated datasets sliced from the test set (see above).

When using adversarial learning techniques, we will only find inputs that violate the invariant; formal verification techniques can be used to prove the absence of such inputs. The number of invariant violations found with adversarial learning techniques is not easily interpreted as a quality measure, but analyzing violating inputs might provide additional insights into how to improve the model.

Simulation-Based Testing

Another strategy to solve the oracle problem that is applicable to some machine learning problems is for tasks where it is easy to generate inputs knowing the outputs. As illustrated with the prime factorization example above where we can just make up a list of numbers as output and multiply them to get a corresponding input, there are problems where computing the output from an input is hard, but creating an input for an expected output is much easier.

Unfortunately, this only works for a limited number of cases. It has been prominently explored for self-driving cars, where it is to some degree possible to test vision algorithms with pictures rendered from an artificial scene. For example, we can generate lots of scenes with any number of pedestrians and then test that a pedestrian detector model finds the pedestrians in those pictures that contain them with the appropriate bounding boxes. We can also test how the model works under challenging conditions, such as simulated fog or camera issues.

In our cancer detection example, we may be able to synthetically insert cancer into non-cancer images and additionally simulate other challenging health conditions. However, creating such a simulator to create realistic and diverse images may be a massive engineering challenge in itself.

This approach obviously does not work if the input-output relationship is unknown to begin with and cannot be simulated, as when predicting housing prices, or when the reverse computation is just as hard as the computation to be evaluated, e.g., generating captions for images vs generating images based on a caption.

Adequacy Criteria

Since software testing can only show the presence of bugs and never the absence, an important question is always when one can stop testing, i.e., whether the tests are adequate. There are many strategies to approach this question, even though typically one simply pragmatically stops testing when time or money runs out or it seems “good enough”. In machine learning, analysis of statistical power could be potentially used, but in practice simple rules of thumb seem to drive the size of validation sets used.

Line and branch coverage report in software testing

In traditional software testing contexts, different forms of coverage and mutation scores are proposed and used to evaluate the quality of a test suite:

Specification coverage analyzes whether all conditions of the specification (boundary conditions, etc) have been tested. Given that we don’t have specifications for machine-learning models, the best we can do seems to be to think about the representativeness of validation data, about subpopulations, about capabilities, and about use cases when curating multiple validation sets.
White-box coverage like line coverage and branch coverage gives us an indication of what parts of the program have been executed. There have been attempts to define similar coverage metrics for deep neural networks (e.g., neuron coverage), but it is unclear what this measure really represents. For example, are inputs generated to cover all neuron activations representative of anything meaningful?
White-box coverage metrics are sometimes used to drive the search for inputs that violate invariants, not unlike their use in test suite minimization or coverage-guided fuzzing, but adversarial learning techniques seem to provide more direct search methods on models than coverage-based methods.
Mutation scores indicate how well a test suite detects injected faults (mutants) into the program. For example, we might randomly replace a `+` operation by a `-` operation in the program and see whether a test suite is good enough to catch that injected fault by failing some previously passing tests. A better test suite would detect more injected faults. However, again, a mapping to machine-learned models is unclear. Would better curated validation sets catch more mutations to the parameters of a machine-learning model (or mutations of hyperparameters or mutations in some training data) and would this be useful in practice for evaluating the quality of the validation sets or similar? When would we consider an injected fault to be detected if we do not have a specification and measure accuracy across an entire test dataset?

While there are several research papers on these topics, there does not seem to be a good mapping to problems in evaluating model quality, given the lack of specifications. At this point, the most convincing strategies to determine whether a test data set is adequate for evaluating model quality seems to judge the amount of data and whether that data is representative of data in production (qualitatively or statistically).

Summary

Even though evaluating model fit is substantially different from evaluating correctness of traditional code against a specification, there are significant parallels with traditional software testing that can provide guidance on how to evaluate a model’s quality.

Specifically by analyzing the problem domain and capabilities, it is possible to curate multiple test sets, each corresponding to a traditional unit test for a specific part of the functionality. Similarly, recognizing that not all inputs are equal and that certain inputs or kinds of mistakes may be important (even though they may be rare) encourages us to specifically analyze subsets of the test data, for example, sliced by demographic attributes or problem characteristics — in analogy to equivalence class testing.

In addition, random testing can be useful in cases where there is a plausible solution to the oracle problem. In particular, invariants and simulation provide promising directions. Invariants (often metamorphic relations) can encode partial specifications that can be tested with randomly generated data; many invariants can be related to intended model capabilities.

In summary, there are many more strategies to evaluate a machine-learned model’s quality than just traditional accuracy metrics. While software testing is not a perfect analogy, software engineering provides many lessons about curating multiple validation datasets, about automated testing for certain invariants, and about test automation. Evaluating a model in a production machine learning system should not stop with the first accuracy number on a static dataset. Indeed, as we will see later, testing in production is possibly even more important.

Readings and References

A quick introduction to model evaluation, including accuracy measures, slicing, and rules of thumb for adequacy criteria: Hulten, Geoff. “Building Intelligent Systems: A Guide to Machine Learning Engineering.” Apress, 2018, Chapter 19 (Evaluating Intelligence).
More detailed discussion of testing capabilities: Christian Kaestner. “Rediscovering Unit Testing: Testing Capabilities of ML Models”, Blog Post 2021
Examples of capabilities tested through curated and generated tests in NLP systems: Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.” In Proceedings ACL, p. 4902–4912. (2020).
Another paper identifying and testing capabilities in NLP systems using several different test generation techniques: Naik, Aakanksha, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. “Stress test evaluation for natural language inference.” Proceedings of the 27th International Conference on Computational Linguistics, p. 2340–2353 (2018).
A discussion of slicing of test data: Barash, Guy, Eitan Farchi, Ilan Jayaraman, Orna Raz, Rachel Tzoref-Brill, and Marcel Zalmanovici. “Bridging the gap between ML solutions and their business requirements using feature interactions.” In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1048–1058. 2019.
Examples of invariants to detect problems in NLP models: Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Semantically equivalent adversarial rules for debugging NLP models.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 856–865. 2018.
An overview of much recent research in model testing, with a focus on invariants and test case generation: Ashmore, Rob, Radu Calinescu, and Colin Paterson. “Assuring the machine learning lifecycle: Desiderata, methods, and challenges.” arXiv preprint arXiv:1905.04223. 2019.
An extended discussion of how multiple models achieve similar accuracy on a test set but differ in how they generalize to out-of-distribution data, motivating the importance of testing capabilities: D’Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen et al. “Underspecification presents challenges for credibility in modern machine learning.” arXiv preprint arXiv:2011.03395 (2020).
A survey on metamorphic testing, beyond just machine learning: Segura, Sergio, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. “A survey on metamorphic testing.” IEEE Transactions on software engineering 42, no. 9 (2016): 805–824.
Description of the Overton system at Apple that heavily relies on evaluating slices of the test data: Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. “Overton: A Data System for Monitoring and Improving Machine-Learned Products." arXiv preprint arXiv:1909.05372 (2019).
Example of a formal verification approach for robustness invariants: Singh, Gagandeep, Timon Gehr, Markus Püschel, and Martin Vechev. “An abstract domain for certifying neural networks." Proceedings of the ACM on Programming Languages 3, no. POPL (2019): 1–30.
Example of a simulation-based evaluation strategy for autonomous driving: Zhang, Mengshi, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. “DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems.” In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 132–142. 2018.
Invariant mining technique for machine learning models, developed for model explanations: Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Anchors: High-precision model-agnostic explanations.” In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

As all chapters, this text is released under Creative Commons 4.0 BY-SA license.