Model Quality: Measuring Prediction Accuracy

The data scientist’s toolbox and its pitfalls

23 min readMar 22, 2021

This post is the second chapter of three on model quality of our Machine Learning in Production course. The previous chapter discussed what correctness actually means for a machine-learned model. This chapter focuses on traditional measures of prediction accuracy for machine-learned models and pitfalls of such evaluation strategy. The next chapter will then look more broadly at other testing techniques.

We start with an overview of traditional data science measures of prediction accuracy that are usually well covered in any machine learning course and textbook, before looking into the limitations and pitfalls of such an evaluation strategy. We’ll use the same running example of a cancer predictor from the previous chapter.

Classification Tasks

The standard measure for a classification task like cancer prediction is accuracy, which measures for a given dataset what percent of predictions correspond to the actual expected answers.

accuracy = correct predictions / all predictions

def accuracy(model, xs, ys):
  count = length(xs)
  countCorrect = 0
  for i in 1..count:
    predicted = model(xs[i])
    if predicted == ys[i]:
      countCorrect += 1
  return countCorrect / count

Note that this accuracy evaluation assumes that we have labeled data for which we know the expected results. Accuracy is computed only for this dataset and does not directly say anything about predictions for any other data.

Data scientists usually visualize the difference between expected and actual predictions in a classification problem with a table called the error matrix or confusion matrix. For example, the table below shows results from 164 predictions in a three-class classification problem, with an accuracy of (10+24+82)/(10+6+2+3+24+10+5+22+82) = 0.71:

Such a table allows us to inspect more closely what kind of mistakes the model makes; for example, Grade 3 cancer inputs are fairly frequently misclassified as benign, whereas benign inputs are rarely misclassified as Grade 5 cancer.

Multi-class problems can always be represented also as binary classification problems by picking a single class of interest. For example, the above error matrix can also be represented as matrices focusing only on the correctness of Grade 5 or Grade 3 cancer predictions:

Types of mistakes. For binary classification problems, it is common to distinguish two kinds of mistakes: false positives and false negatives.

False positives correspond to false alarms where cancer is predicted by the model although the patient does not actually have cancer. This is also known as a Type II error.

False negatives correspond to missed alarms, where a patient actually has cancer but the model does not detect it.

Both false positives and false negatives have costs, but they are not necessarily equal, leading us to prefer models that minimize one or the other, depending on the scenario. For example, in the cancer scenario, a false positive prediction possibly causes stress for the patient and requires the patient to undergo procedures to verify the diagnosis (or even undergo unnecessary treatment) that add additional risks and costs. In contrast, a false negative prediction can lead to cases where cancer remains undetected until it grows to a point where treatment is no longer successful. One could argue that false negatives are much worse in this scenario. Other scenarios may have very different cost-benefit tradeoffs. For example, consider recidivism prediction, loan applications and human trafficking detection, and the cost of false positives and false negatives.

Recall and precision. For binary classification, many more specific measures are defined based on values in the error matrix that are helpful at understanding the ratio of false positives and false negatives. The most common ones are recall and precision.

Recall (also known as true positive rate, hit rate, or sensitivity) describes the percentage of all data points in the target class that were correctly identified as belonging to the target class (TP/TP+FN). That is, it measures how many of all cancer cases we detect.

Precision (also positive predictive value) indicates how many of the cases predicted to belong to the target class actually belong to the target class (TP/

TP+FP). That is, it measures what percentage of cancer warnings are actual cancer cases.

Many other measures have been defined on the error matrix of binary classification problems.

*Visualization of recall and precision, CC BY-SA 4.0 by* *Walber*

Recall and precision are useful measures when it is important to distinguish false positives and false negatives. This is particularly important when costs of mistakes are very different or when baseline probabilities are skewed. For example, consider that only 1 in 2000 patients in a screening sample has cancer (which seems fairly typical). A random classifier would achieve about 50% accuracy and recall, but only 0.1% precision. In contrast, a predictor that predicts all cases as not cancer would achieve 99.9% accuracy and precision but 0% recall. Looking at accuracy and precision provides much more insight than just looking at accuracy measures.

Precision-recall tradeoff and area under the curve. Many classification algorithms do not predict a class but give a score for each class, or a single score for binary classification problems. The score typically is interpreted as how confident a model is in its prediction. It is up to the user of the model to define how to map the score into classes. In multi-class problems, typically the prediction with the highest score is considered. In binary prediction problems, a threshold decides what score is sufficiently confident to predict the target class. For example, our cancer predictor might return scores between 0 and 1:

def hasCancer(image: byte[][], age: int, …): float

By adjusting the threshold in a binary classification problem, one can trade off between false positives and false negatives. With a lower threshold, more cases are considered as positive cases (e.g., cancer), thus typically detecting more true cases, but also reporting more false positives, i.e., increasing recall but reducing precision. In contrast, with a higher threshold, we are more selective of what is considered as a positive case, typically reducing false positives, but also detecting fewer true positives, i.e., increasing precision at the cost of recall.

How to set a threshold to calibrate between recall and precision depends on the specific problem and the cost of the different kinds of mistakes. For example, in the cancer case, we might accept more false alerts to miss fewer actual cancer cases, as the cost of extra diagnostics on false alerts is much lower than the potential health risk of missed cancer cases.

Recall and precision at different thresholds can be plotted in a graph. We see that extreme classifiers can achieve 100% recall and near 100% precision, but at the cost of the other measure. By setting a suitable threshold we can reach any point on the line.

The plot shows the recall precision/tradeoff at different thresholds (the thresholds are not shown explicitly). Curves closer to the top-right corner are better considering all possible thresholds. Typically, the area under the curve is measured to have a single number for comparison.

Various area-under-the-curve measures evaluate the accuracy of a classifier at all possible thresholds (ROC curves are the most common). It allows a comparison of two classifiers across all different thresholds. Area-under-the-curve measures are popular in machine learning research because they sidestep the problem of finding a suitable threshold for an application of interest, but they are less useful in practice when certain recall or precision ranges are simply not acceptable for practical purposes.

Other accuracy measures. Finally, lots of other accuracy measures exist for classification problems to focus on different aspects of the prediction accuracy and costs of wrong predictions. Examples include lift, break-even point, F1 measure (harmonic mean of recall and precision), log loss (for class probabilities), Cohen’s kappa, and the Gini coefficient (improvement over random). Ask your data scientist for details.

Other Accuracy Measures (Regression, Ranking, NLP)

Regression tasks. Where classification models predict one of a finite set of outcomes, regression models predict a number. This is suitable for a different class of problems, such as predicting the sales price of a house.

def homeValue(rooms: int, crimeRate: float, …): float

Again, we want to compare predictions against actual results and want an accuracy measure that allows us to distinguish pretty good predictions from far-off predictions. Unless we bucket all results into a finite number of groups, an error matrix cannot be used because it only works with a finite number of classes.

Instead, accuracy measures typically compare the average distance between the predicted and the actual value. For example, a common measure is the Mean Absolute Percentage Error (MAPE) that computes how far off the predictions are compared to the expected value (in percent) for each data point and reports the average across all errors. In our example above, we find MAPE = 1/4*(20/250+32/498+1/211+9/210)=0.048.

Several other measures exist, such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE), and R² (percentage of variance explained by model). Which measure to use depends on the problem and the characteristics of the data. For example, mean absolute percentage errors report very large errors on deviations around very small values (like predicting the time it takes to read a short article), but fairly small errors around the same kinds of absolute deviations on very large values (like predicting home values).

Ranking tasks. For rankings, such as suggesting search results or recommending Youtube videos, we need yet again different accuracy measures. If we just consider each prediction to be correct or wrong, the goal is to rank the correct results at high positions in our model’s prediction. If we care about different degrees of correctness, we may also consider the order of “correct” recommendations.

Again, many different specialized accuracy measures exist. For example, the Mean Average Precision (MAP@K) measure reports how many correct results are found in the top-K results of a prediction, averaged across multiple queries.

For example, in recommending products with the following actual preferences of a user, we find MAP@1 = 1 (the top result is correct), MAP@2 = 0.5 (only one of the top two results are correct), MAP@3 = 0.33, MAP@4 = 0.5 and so forth.

There are many other ranking accuracy measures (Moussa Taifi’s blog post provides a good discussion of tradeoffs), including:

Mean Reciprocal Rank (MRR) (average rank for first correct prediction)
Average precision (concentration of results in highest ranked predictions)
MAR@K (recall)
Coverage (percentage of items ever recommended)
Personalization (how similar predictions are for different users/queries)
Discounted cumulative gain

Natural language processing. Some NLP tasks like detecting positive or negative sentiment in a text or determining the truth of a natural language statement can be reduced to a classification problem, and hence recall, precision, and similar metrics can be used. However, many NLP tasks such as translation, summarization, or image captioning need to compare how close generated text is to expected correct text. There is a lot of controversy about how to best evaluate such tasks, but BLEU and ROUGE are common metrics to compare the degree of similarity between generated and provided texts. If the goal is to evaluate only the structure and plausibility of text for natural language, likelihood and perplexity are common metrics.

Comparing Against Baselines

Accuracy measures (independent of whether for classification, regression, ranking, or other problems) cannot meaningfully be interpreted in isolation. A 99% accuracy result may be exceptionally good for some problems, but actually mediocre or even terrible for other problems. Even 10% accuracy may be good enough for practical use in some very difficult tasks, where a user may be willing to look through many false positives. As discussed in the previous chapter, we do not evaluate model correctness (where 99% accuracy would indicate a faulty model) but whether the model fits or is useful for a problem; what accuracy is needed to be useful depends entirely on the problem.

To allow any form of meaningful interpretation it is important to compare accuracy against some baselines and consider the cost of mistakes. If we know that baseline approaches can achieve 98.5% accuracy, 99% accuracy may no longer seem impressive unless the remaining mistakes are really costly, but if a baseline approach only achieves 30% accuracy 99% is a huge improvement. To measure the improvement over some baseline, reduction in error is a useful metric. Reduction in error is computed as ((1 - baseline-accuracy) - (1 - model-accuracy)) / (1 - baseline accuracy). Going from 50% to 75% accuracy is a 50% reduction in error, whereas going from 99.9% to 99.99% accuracy is a much higher reduction in error of 90%.

What baseline is suitable for comparison depends entirely on the problem. There are several obvious baselines that could be considered:

For classification problems, output the most common class as the prediction; for regression problems output the average value. For example, we would always predict “no cancer” in our cancer detection problem.
Randomly predict a result, possibly drawn from some calibrated probability distributions. For example, for predicting temperatures, output 20±5 C in the summer and 5±10 C in the winter.
In time-series data, output the last value as the prediction. For example, tomorrow’s weather is likely to be similar to today’s.
Use simple non-machine-learning heuristics (a few hand-written hard-coded rules) for the problem; for example, predict cancer if there are more than 10 dark pixels surrounded by lighter pixels.

In addition, it is always useful to use existing state-of-the-art models as a baseline if they exist. If they don’t it is often a good idea to build a simple baseline model, for example using random forests with minimal feature engineering as a baseline when evaluating a deep neural network.

Measuring Generalization

Machine learning models identify rules from data. With sufficient degrees of freedom, they can learn rules that perfectly explain all observations in the training data, but those rules may overfit the training data and work poorly on unseen data, even if it only differs slightly from the training data. For example, our cancer detector might memorize the outcomes for all training data and thus predict those perfectly, without being able to make any meaningful predictions for unseen inputs of other patients. A key goal of measuring a model’s quality is to measure how well it generalizes to unseen data in the target population.

Detecting overfitting

A key insight underlying all machine learning techniques is to balance between finding rules that fit the training data well and rules that are relatively simple, hoping that simpler rules generalize better and are less influenced by noise in the training data. This is known also as the bias-variance tradeoff. Most machine learning algorithms have hyperparameters that can control this tradeoff, such as controlling the degrees of freedom a model has or specifying a penalty for complexity during the learning process (known as regularization).

Change hyperparameter (x-axis) to control the bias-variance tradeoff, revealing effects on accuracy within the training data (blue line) and accuracy within validation data (red line) at different degrees of freedom (CC SA 3.0 by *Dake*)

Overfitting can be detected by evaluating prediction accuracy both on the training data and on validation data that was not seen at training time. Models with more degrees of freedom can fit the training data better and make better predictions within the training data. With more degrees of freedom, the model can learn more useful rules from the data, which typically also improves predictions on the validation data, up to a point. At some point, though, with more degrees of freedom, the model will start to overfit and discover rules that fit noise in the training data but provide no help in understanding concepts in the domain, resulting in reduced accuracy on the validation data.

Separating Training, Validation, and Test Data

In machine learning, evaluations thus always have to be performed on unseen validation data that was not available at training time. This can be production data, but in offline evaluations, the most common approach is to evaluate a model on held-out validation data that were collected together with the training data:

train_xs, train_ys, valid_xs, valid_ys = split(all_xs, all_ys)
model = learn(train_xs, train_ys)
accuracy_train = accuracy(model, train_xs, train_ys)
accuracy_valid = accuracy(model, valid_xs, valid_ys)

Beyond a single train-validation split, it is also common to repeatedly split the data and evaluated separate models trained in each split, in a process called cross-validation. Here, the same dataset is split multiple times into training and validation data, using one of many splitting strategies available (e.g., k-fold, leave-one-out, or Monte Carlo sampling).

*(Graphic CC* *MBanuelos22* *BY-SA 4.0)*

When a model is repeatedly evaluated on validation data or hyperparameters are tuned with validation data, there is a risk of overfitting the model on validation data: Even though the validation data was not directly used for training, it was used for making decisions in the training process. It is therefore common to not only split between training and validation data but additionally also hold out test data that, ideally, is only used once for a final evaluation.

train_xs, train_ys, valid_xs, valid_ys, test_xs, test_ys =
                                split(all_xs, all_ys)best_model = null
best_model_accuracy = 0
for (hyperparameters in candidate_hyperparameters)
  candidate_model = learn(train_xs, train_ys, hyperparameter)
  model_accuracy = accuracy(model, valid_xs, valid_ys)
  if (model_accuracy > best_model_accuracy)
    best_model = candidate_model
    best_model_accuracy = model_accuracyaccuracy_test = accuracy(model, test_xs, test_ys)

On terminology: Parameters and Hyperparameters

In the machine learning community, the term parameters is commonly used to describe the decisions (e.g., weights, coefficients) of a model. The values of those parameters are learned from data during the learning process. To a software engineer, these are constants in the algorithm that the model represents, whereas the user inputs would be considered as parameters passed to the model during prediction.

The parameters passed to the learning algorithm that controls how the learning algorithm builds the model are called hyperparameters in the machine-learning community. Hyperparameters may include regularization weights, fitness functions, as well as the architecture of neural networks. To a software engineer, these are options for the learning algorithm, similar to compiler options.

// max_depth and min_support are hyperparameters
def learn_decision_tree(data, max_depth, min_support): Model = …// A, B, C are model parameters of model f
def decision_tree_model(outlook, temperature, humidity, windy) =
  if A==outlook:
    return B*temperature + C*windy > 10
  …

Generalizing beyond the training distribution?

A key assumption in essentially all offline accuracy evaluations of machine-learned models is that training, validation, and test data are drawn from the same distribution and that this distribution is representative of the target distribution. The technical term is “independent and identically distributed (i.i.d)”. Data that does not come from the same target distribution is typically referred to as out-of-distribution data.

This i.i.d. assumption is well reflected in the typical train-validation-test data split, where first data is sampled from the target population and then this data is split into separate sets, all drawn from the same distribution.

Much of machine-learning theory is deeply rooted in the assumption that we try to generalize to the same distribution from which the training data is drawn. Technically one may not have any reason to expect a system to generalize beyond the distribution from which training and test data is sampled. Many machine-learning experts argue that it is unfair to evaluate a model on out-of-distribution data, just as it would be unfair to quiz an elementary-school child on high-school math — we simply haven’t taught the relevant concepts.

In practice though, we often hope that the model generalizes beyond the distribution of the training data in many settings. We hope that the model is robust to adversarial attacks, where attackers craft inputs that do not naturally occur in practice to trick the model. We hope that the model can deal gracefully with distribution shifts over time (known as data drift, see data quality chapter for details), e.g., using the cancer predictor on an increasingly aging population for which we have little training data. We hope that the model can generalize even if the training and test distribution is not exactly representative of the distribution in production, e.g., when the cancer detector is deployed in a different hospital with different equipment and different typical patient demographics.

While it is possible to specifically test against out-of-distribution data (see more on capability testing and invariants in the next chapter), the standard accuracy evaluation in machine learning does not attempt to make any judgement of the model’s accuracy beyond the training distribution.

Pitfalls of Measuring Prediction Accuracy Offline

Unfortunately, it is quite common to find that prediction accuracy observed in an offline evaluation on a test dataset to be much higher than what can be achieved in an independent evaluation or in production. This problem can be equally found in academic papers that report high accuracy numbers that are challenging to reproduce independently, as well as in real-world data science projects where initially promising results during development do not hold when deploying the system. Here, we list a number of common reasons for this discrepancy, to provide awareness of pitfalls and to encourage participants in model development and evaluation to pay particular attention in order to avoid these pitfalls.

Using test data that are not representative

A (usually implicit) assumption behind the regiment of measuring prediction accuracy with test data is that the test data is representative of the data to be expected in practice, i.e., that it is drawn from the same distribution.

Unfortunately, this is not always true in practice. For example, we may develop and evaluate our cancer detector on scans from equipment in a specific hospital with patients typical for that hospital (e.g., shaping age and ethnic demographics or environmental exposure to cancer-causing toxins) and in a given time window. While the model may work well for predictions in this specific setting, we may see much lower prediction accuracy when trying to deploy at a later time or in a different hospital, because we are trying to generalize beyond the training distribution

When the distribution of input data in production is different from the distribution during training and evaluation, we are making lots of out-of-distribution predictions for which the model has low confidence. Accuracy reported from offline evaluation may be substantially higher than accuracy found in production.

To avoid this problem it is important to collect representative test data from the actual target distribution and to account for data drift (distributions changing over time, see data quality chapter). Collecting representative data can be a significant challenge that may require substantial attention when building a production machine learning system, because the target distribution may be poorly understood or data may only be partially accessible. Often data samples that are convenient to collect or already publicly shared are not representative of the target distribution of a production system, hence one either has to compensate for that problem (e.g., detect out of distribution data as part of a risk mitigation strategy, see risk chapter) or get involved in own more representative (and often more expensive) data collection strategies.

Using misleading accuracy metrics

Accuracy results can be made to look very promising, even when the actual accuracy is not suitable for a task. It is always important to use suitable accuracy measures and compare them against meaningful baselines.

For example, using accuracy with very skewed base probabilities (e.g., only one out of 2000 patients in random screening is expected to have cancer, as discussed above) will often result in very high accuracy numbers for even very bad predictors. Using recall and precision may reveal a clearer picture of how the predictor is doing.

In most settings, the cost of false positives and false negatives is not the same, and one is preferable over the other (e.g., a false cancer warning is likely better than missing a cancer diagnosis). Accuracy and area-under-the-curve measures do not reflect such differences and may lead to misleading interpretations. More targeted measures like recall and precision or even derived utility or cost measures may be more useful.

Similarly, averaging across all data, as virtually all common accuracy measures do may miss nuances and fairness issues. For example, the cancer classifier may work well on the majority of patients but often miss cancer in minority patients (e.g., black women). When minority patients are underrepresented in the target population and test data, these bad predictions for a subpopulation may barely hurt the overall accuracy measure. In practice, it is often important to look at subpopulations, as we will discuss below.

Finally, results on very small test sets are naturally noisier and should be trusted less than results on larger test sets, possibly leading to positive accuracy results that are simply due to chance.

Evaluating on training or validation data

Even though separating data into training, validation, and test sets is a commonly used approach, it is surprising how often mistakes happen in practice where these data sets overlap. For example, a data scientist may randomly split data and copy the separated sets to different files, but later run another split but accidentally only update one of the files, resulting in overlaps between multiple files. Even simpler, one may simply use the wrong variable when passing data around within a notebook.

Paying particular attention to clean training, validation, and test data separation in the learning code is a good starting point. Reviewing all learning code and having reviewers pay attention to this issue is better. Ideally, one might want to even automatically track data provenance and information flow, to make sure that test data never flows into any learning or hyperparameter tuning steps. Unfortunately, while various data tracking mechanisms are well researched, such provenance tracking tools are not yet broadly available for data science pipelines.

Dependence between training and test data

When splitting data into training, validation, and test data, data is typically randomly assigned to the three datasets from a single initial dataset, often using cross-validation. However, random splitting is only valid if all data points are independent. There are many settings where this may not be true.

The most common one is time-series data, where future data depends on past data. For example, predicting stock market prices for a day is so much easier when we know not only past stock prices but also the stock price on the days after. When dealing with time-series data, it is important to avoid peeking into the future during training. Data should be split at one point in time and data older than that point can be used for training and newer data for validation. It is also possible to repeat training for every point in time and each time evaluate for only the next data point.

The curve is the real trend, red points are training data, green points are validation data. If validation data is randomly selected, it is much easier to predict, because the trends around it are known, but the resulting learned model may not perform well on real data when deployed.

Another common form of dependency occurs when the original dataset contains multiple data points that belong together in some form. For example, we may have multiple cancer scans from the same patient; now if some of those scans are in the training data and some in the test data, the model may have learned patient-specific information. The model may simply have learned which scans are from the same patient and report whether cancer was present in training images for that patient. Or if we have multiple cancer and non-cancer images from the same patients in the dataset, the model may have learned to distinguish cancer from non-cancer images well for that patient, but not generally for all patients. Rachel Thomas reports a similar dependency from a Kaggle competition to detect distracted drivers where the dataset contains multiple images per driver and multiple solutions learn models that detect distractions well for those drivers but not for others.

If we have data where multiple data points have dependencies, a good strategy is to cluster dependent data (e.g., all cancer scans of one patient or all photos from one driver) and randomly assign clusters to training and validation dataset.

As data dependencies can sometimes be subtle, this may require extra attention and some experience. It is worth discussing these kinds of potential issues in the team.

Label leakage

There are lots of examples of shortcut learning where a model does not learn the actual concepts of the tasks (e.g., how does cancer look in a scan) but instead picks up on subtle signals that leak the true label in our dataset.

The most common example is probably the frequently shared urban legend of an early military AI project trying to detect Russian tanks in pictures with neural networks: It worked really well on the held-out test set but utterly failed in production. It turns out that the data (from which both training and test data were drawn) was gathered in a way that pictures with tanks were taken on sunny days whereas the other pictures were taken on cloudy days. The model then simply learned to distinguish weather — a signal that has a spurious correlation with the detection of tanks in practice. This shortcut of learning to detect weather instead of tanks was possible due to artifacts in the dataset, caused by the process of how the data was collected; it obviously is not something we can rely on in production.

Another commonly shared example relates to cancer detection, where the scanning equipment’s ID number is usually embedded in every scan. This equipment number may be useful when tracking down specific problems (e.g., a machine is considered defective and all scans need to be repeated), but should not matter for actually detecting cancer. Instead of learning how to detect cancer, the model may learn to distinguish stationary from mobile scanning equipment and figure out that scans from mobile equipment are much more likely to contain cancer. The reason is that mobile equipment is used more commonly for very sick patients who cannot be moved to the stationary scanner. That is, in this case, the model has learned to pick up on a signal that is correlated with the true label, even though it should be irrelevant to the task.

In extreme cases, we may even accidentally include the expected outcome in our training and validation data, in which case the model would learn to make perfect predictions since it learns to simply predict the outcome that is already contained in the test data.

If the same subtle signal that leaks the true answer is available in production, a model may very well be right in depending on that signal, even though it may not be causally related. For example, if the human judgment that went into deciding what scanning equipment to use is available in production and we are not in fact trying to replace that judgment — then, by all means, we should use it for prediction. However, if the leakage is just an artifact of how the data used in model training and evaluation was collected but that is not available in production as in the tank example, we will get way too optimistic accuracy results that evaporate in production.

Label leakage can be challenging to detect. Beyond being suspicious of “too good to be true” results, the best strategy is likely to try to understand what inputs the model is using to make its decision using explainability techniques (see chapter on interpretability and explainability) to look for suspicious influences.

Overfitting on test data

The danger of overfitting is widely understood and when hyperparameters are tuned on validation data it is a common practice to hold out test data to detect potential overfitting during hyperparameter tuning. In competitions, it is often common to not release the test data, but perform the evaluation independently by the competition organizer to avoid repeated evaluations and overfitting. In non-competition settings, one may deliberately set aside test data and never look at it until modeling is complete. However, overfitting can still occur.

In a competition, the winning model may perform best on the test set due to some random chance without truly generalizing best. In the next competition round, this model may be used as the starting point by many entries, again selecting the model that best fits the hidden test data.

In a research community publishing on a shared benchmark problem, the reviewing process is biased toward publishing only those papers presenting the best results. The collective search by the community for the best model is not unlike hyperparameter tuning on the validation set. In fact, it has been observed that published models for many benchmark problems achieve much higher accuracy results on those benchmarks than in production or on fresh datasets from the same distribution.

*If many researchers publish best results on the same benchmark, collectively they perform “hyperparameter optimization” on the test set — Figure by Andrea Passerini*

Finally, this effect can happen in a continuous experimentation environment, where many evaluations of experiments (after hyperparameter tuning) are compared on the same test data. Results are prone to overfitting if developers continue tuning their models based on results found with the same test data. For example, the ease.ml project has developed mechanisms for a continuous-integration setting to determine, with statistical techniques, when test data is losing its ability to report accurate results and need to be refreshed.

Summary

Traditionally, machine learning models are evaluated in terms of their average prediction accuracy across a test set that is drawn from the same distribution as the training set. The specific accuracy measure depends on the problem (e.g., binary vs multi-class classification with and without threshold, regression, ranking) and several accuracy measures account for different costs of different kinds of mistakes. To avoid overfitting, it is common to separate data into training, validation, and test data, all drawn from the same target distribution.

Overall, since whether a model fits or is useful for a problem depends on the specific problem, accuracy numbers cannot be interpreted in isolation. Instead, it is important to establish baselines and evaluate whether accuracy is acceptable for the problem at hand.

There are many common pitfalls in evaluating models that may lead to overly optimistic evaluation results, such as label leakage, overfitting on test data, dependencies between training and test data, where careful engineering and code review are needed.

Readings and References

Pretty much every data science textbook will cover evaluating accuracy, different accuracy metrics, cross-validation, and other relevant strategies.
A quick introduction to model evaluation, including accuracy measures, slicing, and rules of thumb for adequacy criteria: Hulten, Geoff. “Building Intelligent Systems: A Guide to Machine Learning Engineering.” Apress, 2018, Chapter 19 (Evaluating Intelligence).
A discussion of the danger of overfitting on test data in continuous integration systems and strategies to avoid it: Renggli, Cedric, Bojan Karlaš, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang. Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment, SysML 2019
Examples of data dependencies in test sets: https://www.fast.ai/2017/11/13/validation-sets/

As all chapters, this text is released under Creative Commons 4.0 BY-SA license.