Interpretability and explainability

This post covers content from the “Interpretability and Explainability” lecture of our Machine Learning in Production course. For other chapters see the table of content.

Machine-learned models are often opaque and make decisions that we do not understand. As discussed, we use machine learning precisely when we do not know how to solve a problem with fixed rules and rather try to learn from data instead; there are many examples of systems that seem to work and outperform humans, even though we have no idea of how they work. So, how can we trust models that we do not understand? How can we be confident it is fair? Should we accept decisions made by a machine, even if we do not know the reasons? How can one appeal a decision that nobody understands? How can we debug them if something goes wrong?

Without understanding how a model works and why a model makes specific predictions, it can be difficult to trust a model, to audit it, or to debug problems. There are lots of funny and serious examples of mistakes that machine learning systems make, including 3D printed turtles reliably classified as rifles (news story), cows or sheep not recognized because they are in unusual locations (paper, blog post), a voice assistant starting music while nobody is in the apartment (news story), or an automated hiring tool automatically rejecting women (news story). Without understanding the model or individual predictions, we may have a hard time understanding what went wrong and how to improve the model.

In this chapter, we provide an overview of different strategies to explain models and their predictions and use cases where such explanations are useful.

Example: Proprietary opaque models in recidivism prediction

Northpoint’s controversial proprietary COMPAS system takes an individual’s personal data and criminal history to predict whether the person would be likely to commit another crime if released, reported as three risk scores on a 10 point scale. The idea is that a data-driven approach may be more objective and accurate than the often subjective and possibly biased view of a judge when making sentencing or bail decisions.

Without the ability to inspect the model, it is challenging to audit it for fairness concerns, whether the model accurately assesses risks for different populations, which has led to extensive controversy in the academic literature and press. The developers and different authors have voiced divergent views about whether the model is fair and to what standard or measure of fairness, but discussions are hampered by a lack of access to internals of the actual model.

In addition, there is also a question of how a judge would interpret and use the risk score without knowing how it is computed. For example, instructions indicate that the model does not consider the severity of the crime and thus the risk score should be combined without other factors assessed by the judge, but without a clear understanding of how the model works a judge may easily miss that instruction and wrongly interpret the meaning of the prediction.

In contrast, consider the models for the same problem represented as a scorecard or if-then-else rules below. The models both use an easy to understand format and are very compact; a human user can just read them and see all inputs and decision boundaries used. It is easy to audit this model for certain notions of fairness, e.g., to see that neither race nor an obvious correlated attribute is used in this model; the second model uses gender which could inform a policy discussion on whether that is appropriate. Since both are easy to understand, it is also obvious that the severity of the crime is not considered by either model and thus more transparent to a judge what information has and has not been considered.

Interpretable model for recidivism prediction as a scorecard from Rudin, Cynthia, and Berk Ustun. “Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice.” Interfaces 48, no. 5 (2018): 449–466.

Interpretable decision rules for recidivism prediction from Rudin, Cynthia. “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.” Nature Machine Intelligence 1, no. 5 (2019): 206–215.

Defining Interpretability, Explainability, and Transparency

There are many terms used to capture to what degree humans can understand internals of a model or what factors are used in a decision, including interpretability, explainability, and transparency. These and other terms are not used consistently in the field, different authors ascribe different often contradictory meanings to these terms or use them interchangeably. In this book, we use the following terminology:

Interpretability: We consider a model intrinsically interpretable, if a human can understand the internal workings of the model, either the entire model at once or at least the parts of the model relevant for a given prediction. This may include understanding decision rules and cutoffs and the ability to manually derive the outputs of the model. For example, the scorecard for the recidivism model can be considered interpretable, as it is compact and simple enough to be fully understood. Ideally, we even understand the learning algorithm well enough to understand how the model’s decision boundaries were derived from the training data — that is, we may not only understand a model’s rules, but also why the model has these rules.

Explainability: We consider a model explainable if we find a mechanism to provide (partial) information about the workings of the model, such as identifying influential features. We consider a model’s prediction explainable if a mechanism can provide (partial) information about the prediction, such as identifying which parts of an input were most important for the resulting prediction or which changes to an input would result in a different prediction. For example, for the proprietary COMPAS model for recidivism prediction, an explanation may indicate that the model heavily relies on the age, but not the gender of the accused; for a single prediction made to assess the recidivism risk of a person, an explanation may indicate that the large number of prior arrests are the main reason behind the high risk score. Explanations can come in many different forms, as text, as visualizations, or as examples. Explanations are usually easy to derive from intrinsically interpretable models, but can be provided also for models of which humans may not understand the internals. Explanations are usually partial in nature and often approximated. The explanations may be divorced from the actual internals used to make a decision; they are often called post-hoc explanations.

Transparency: We say the use of a model is transparent if users are aware that a model is used in a system, and for what purpose. For example, the use of the recidivism model can be made transparent by informing the accused that a recidivism prediction model was used as part of the bail decision to assess recidivism risk.

Why Explainability?

Interpretable models and explanations of models and predictions are useful in many settings and can be an important building block in responsible engineering of ML-enabled systems in production. There are many different motivations why engineers might seek interpretable models and explanations.

Model debugging: According to a 2020 study among 50 practitioners building ML-enabled systems, by far the most common use case for explainability was debugging models: Engineers want to vet the model as a sanity check to see whether it makes reasonable predictions for the expected reasons given some examples, and they want to understand why models perform poorly on some inputs in order to improve them. Trying to understand model behavior can be useful for analyzing whether a model has learned expected concepts, for detecting shortcut reasoning, and for detecting problematic associations in the model (see also the chapter on capability testing). For example, developers of a recidivism model could debug suspicious predictions and see whether the model has picked up on unexpected features like the weight of the accused. In these cases, explanations are not shown to end users, but only used internally.

Auditing: When assessing a model in the context of fairness, safety, or security it can be very helpful to understand the internals of a model, and even partial explanations may provide insights. For example, it is trivial to identify in the interpretable recidivism models above whether they refer to any sensitive features relating to protected attributes (e.g., race, gender). It can also be useful to understand a model’s decision boundaries when reasoning about robustness in the context of assessing safety of a system using the model, for example, whether an smart insulin pump would be affected by a 10% margin of error in sensor inputs, given the ML model used and the safeguards in the system.

Trust: If we understand how a model makes predictions or receive an explanation for the reasons behind a prediction, we may be more willing to trust the model’s predictions for automated decision making. There are many different components to trust. Similar to debugging and auditing, we may convince ourselves that the model’s decision procedure matches our intuition or that it is suited for the target domain. For example, we may trust the neutrality and accuracy of the recidivism model if it has been audited and we understand how it was trained and how it works. Similarly, we may decide to trust a model learned for identifying important emails if we understand that the signals it uses match well with our own intuition of importance. We may also be better able to judge whether we can transfer the model to a different target distribution, for example, whether the recidivism model learned from data in one state may match the expectations in a different state. We may also identify that the model depends only on robust features that are difficult to game, leading more trust in the reliability of predictions in adversarial settings e.g., the recidivism model not depending on whether the accused expressed remorse.

Actionable insights to improve outcomes: In many situations it may be helpful for users to understand why a decision was made so that they can work toward a different outcome in the future. While in recidivism prediction there may only be limited option to change inputs at the time of the sentencing or bail decision (the accused cannot change their arrest history or age), in many other settings providing explanations may encourage behavior changes in a positive way. For example, explaining the reason behind a high insurance quote may offer insights into how to reduce insurance costs in the future when rated by a risk model (e.g., drive a different car, install an alarm system), increase the chance for a loan when using an automated credit scoring model (e.g., have a longer credit history, pay down a larger percentage), or improve grades from an automated grading system (e.g., avoid certain kinds of mistakes). One can also use insights from machine-learned model to aim to improve outcomes (in positive and abusive ways), for example, by identifying from a model what kind of content keeps readers of a newspaper on their website, what kind of messages foster engagement on Twitter, or how to craft a message that encourages users to buy a product — by understanding factors that drive outcomes one can design systems or content in a more targeted fashion.

Regulation: While not widely adopted, there are legal requirements to provide explanations about (automated) decisions to users of a system in some contexts. For example, the 1974 US Equal Credit Opportunity Act requires to notify applicants of action taken with specific reasons: “The statement of reasons for adverse action required by paragraph (a)(2)(i) of this section must be specific and indicate the principal reason(s) for the adverse action.” This rule was designed to stop unfair practices of denying credit to some populations based on arbitrary subjective human judgement, but also applies to automated decisions. The European Union’s 2016 General Data Protection Regulation (GDPR) includes a rule framed as Right to Explanation for automated decisions: “processing should be subject to suitable safeguards, which should include specific information to the data subject and the right to obtain human intervention, to express his or her point of view, to obtain an explanation of the decision reached after such assessment and to challenge the decision.” The specifics of that regulation are disputed and at the point of this writing no clear guidance is available. Explainability mechanisms may be helpful to meet such regulatory standards, though it is not clear what kind of explanations are required or sufficient.

Curiosity, learning, discovery, causality, science: Finally, models are often used for discovery and science. Statistical modeling has long been used in science to uncover potential causal relationships, such as identifying various factors that may cause cancer among many (noisy) observations or even understanding factors that may increase the risk of recidivism. In such contexts, we do not simply want to make predictions, but understand underlying rules. If we understand the rules, we have a chance to design societal interventions, such as reducing crime through fighting child poverty or systemic racism. Neither using inherently interpretable models nor finding explanations for black-box models alone is sufficient to establish causality, but discovering correlations from machine-learned models is a great tool for generating hypotheses — with a long history in science.

More powerful and often hard to interpret machine-learning techniques may provide opportunities to discover more complicated patterns that may involve complex interactions among many features and elude simple explanations, as seen in many tasks where machine-learned models achieve vastly outperform human accuracy. Yet, we may be able to learn how those models work to extract actual insights. For example, consider this Vox story on our lack of understanding how smell works: Science does not yet have a good understanding of how humans or animals smell things. We know that dogs can learn to detect the smell of various diseases, but we have no idea how. We know some parts, but cannot put them together to a comprehensive understanding. Yet it seems that, with machine-learning techniques, researchers are able to build robot noses that can detect certain smells, and eventually we may be able to recover explanations of how those predictions work toward a better scientific understanding of smell.

Understanding a Model

We can discuss interpretability and explainability at different levels. We start with strategies to understand the entire model globally, before looking at how we can understand individual predictions or get insights into the data used for training the model.

When trying to understand the entire model, we are usually interested in understanding decision rules and cutoffs it uses or understanding what kind of features the model mostly depends on. Here, we can either use intrinsically interpretable models that can be directly understood by humans or use various mechanisms to provide (partial) explanations for more complicated models.

Intrinsically Interpretable Models

Sparse linear models are widely considered to be inherently interpretable. We can inspect the weights of the model and interpret decisions based on the sum of individual factors. The model coefficients often have an intuitive meaning. Linear models can also be represented like the scorecard for recidivism above (though learning nice models like these that have simple weights, few terms, and simple rules for each term like “Age between 18 and 24” may not be trivial). In spaces with many features, regularization techniques can help to select only the important features for the model (e.g., Lasso).


Not all linear models are easily interpretable though. If linear models have many terms, they may exceed human cognitive capacity for reasoning. If the features in those terms encode complicated relationships (interactions, nonlinear factors, preprocessed features without intuitive meaning), one may read the coefficients but have no intuitive understanding of their meaning.

Shallow decision trees are also natural for humans to understand, since they are just a sequence of binary decisions. At each decision, it is straightforward to identify the decision boundary. As long as decision trees do not grow too much in size, it is usually easy to understand the global behavior of the model and how various features interact. For example, the if-then-else form of the recidivism model above is a textual representation of a simple decision tree with few decisions. Just as linear models, decision trees can become hard to interpret globally once they grow in size. Beyond sparse linear models and shallow decision trees, also if-then rules mined from data, for example, with association rule mining techniques, are usually straightforward to understand. Here each rule can be considered independently.

In contrast, neural networks are usually not considered inherently interpretable, since computations involve many weights and step functions without any intuitive representation, often over large input spaces (e.g., colors of individual pixels) and often without easily interpretable features. Random forests are also usually not easy to interpret because they average the behavior across multiple trees, thus obfuscating the decision boundaries.

It is a broadly shared assumption that machine-learning techniques that produce inherently interpretable models produce less accurate models than non-interpretable techniques do for many problems. For example, sparse linear models are often considered as too limited, since they can only model influences of few features to remain sparse and cannot easily express non-linear relationships; decision trees are often considered unstable and prone to overfitting. Certain vision and natural language problems seem hard to model accurately without deep neural networks. Hence many practitioners may opt to use non-interpretable models in practice.

Global Surrogate Models

For models that are not inherently interpretable, it is often possible to provide (partial) explanations. This can often be done without access to the model internals just by observing many predictions. There is a vast space of possible techniques, but here we provide only a brief overview.

Even if the target model is not interpretable, a simple idea is to learn an interpretable surrogate model as a close approximation to represent the target model. As surrogate models, typically inherently interpretable models like linear models and decision trees are used. To this end, one picks a number of data points from the target distribution (which do not need labels, do not need to be part of the training data, and can be randomly selected or drawn from production data) and then asks the target model for predictions on every of those points. Taking those predictions as labels, the surrogate model is trained on this set of input-output pairs. The resulting surrogate model can be interpreted as a proxy for the target model.

For example, even if we do not have access to the proprietary internals of the COMPAS recidivism model, if we can probe it for many predictions, we can learn risk scores for many (hypothetical or real) people and learn a sparse linear model as a surrogate. Many discussions and external audits of proprietary black-box models use this strategy.

Using decision trees or association rule mining techniques as our surrogate model, we may also identify rules that explain high-confidence predictions for some regions of the input space. For example, we might identify that the model reliably predicts re-arrest if the accused is male and between 18 to 21 years. Such rules can explain parts of the model.

While surrogate models are flexible, intuitive and easy for interpreting models, they are only proxies for the target model and not necessarily faithful. Hence interpretations derived from the surrogate model may not actually hold for the target model. For example, a surrogate model for the COMPAS model may learn to use gender for its predictions even if it was not used in the original model. It is possible to measure how well the surrogate model fits the target model, e.g., through the $R²$ score, but high fit still does not provide guarantees about correctness. If it is possible to learn a highly accurate surrogate model, one should ask why one does not use an interpretable machine learning technique to begin with.

Feature Importance

While it does not provide deep insights into the inner workings of a model, a simple explanation of feature importance can provide insights about how sensitive the model is to various inputs. Feature importance is the measure of how much a model relies on each feature in making its predictions. These are highly compressed global insights about the model.

In a nutshell, one compares the accuracy of the target model with the accuracy of a model trained on the same training data, except omitting one of the features. For example, we may compare the accuracy of a recidivism model trained on the full training data with the accuracy of a model trained on the same data after removing age as a feature. If accuracy differs between the two models, this suggests that the original model relies on the feature for its predictions. The larger the accuracy difference, the more the model depends on the feature. This is simply repeated for all features of interest and can be plotted as shown below. It can be applied to interactions between sets of features too.

Feature importance in the DC bike sharing data, from Christoph Molnar. “Interpretable Machine Learning.” 2019

To avoid potentially expensive repeated learning, feature importance is typically evaluated directly on the target model by scrambling one feature at a time in the test set. That is, to test the importance of a feature, all values of that feature in the test set are randomly shuffled, so that the model cannot depend on it. Note that if correlations exist, this may create unrealistic input data that does not correspond to the target domain (e.g., a 1.8 meter tall infant when scrambling age). For models with very many features (e.g. vision models) the average importance of individual features may not provide meaningful insights.

Partial Dependence Plot (PDP)

While feature importance computes the average explanatory power added by each feature, more visual explanations such as those of partial dependence plots can help to better understand how features (on average) influence predictions. For example, in the plots below, we can observe how the number of bikes rented in DC are affected (on average) by temperature, humidity, and wind speed. These plots allow us to observe whether a feature has a linear influence on predictions, a more complex behavior, or none at all (a flat line). The plots work naturally for regression problems, but can also be adopted for classification problems by plotting class probabilities of predictions.

Example of partial dependence plots for the DC bike sharing data, from Christoph Molnar. “Interpretable Machine Learning.” 2019

Various other visual techniques have been suggested, as surveyed in Molnar’s book Interpretable Machine Learning.

Understanding a Prediction

While the techniques described in the previous section provide explanations for the entire model, in many situations, we are interested in explanations for a specific prediction. For example, we might explain which factors were the most important to reach a specific prediction or we might explain what changes to the inputs would lead to a different prediction.

Below, we sample a number of different strategies to provide explanations for predictions. Many of these are straightforward to derive from inherently interpretable models, but explanations can also be generated for black-box models.

Feature influences

There are many different strategies to identify which features contributed most to a specific prediction.

In a linear model, it is straightforward to identify features used in the prediction and their relative importance by inspecting the model coefficients. For example, when making predictions of a specific person’s recidivism risk with the scorecard shown in the beginning of this chapter, we can identify all factors that contributed to the prediction and list all or the ones with the highest coefficients. Note that we can list both positive and negative factors. For example, based on the scorecard, we might explain to an 18 year old without prior arrest that the prediction “no future arrest” is based primarily on having no prior arrest (three factors with a total of -4), but that the age was a factor that was pushing substantially toward predicting “future arrest” (two factors with a total of +3). It is also always possible to derive only those features that influence the difference between two inputs, for example explaining how a specific person is different from the average person or a specific different person.

Feature influences can be derived from different kinds of models and visualized in different forms. The most common form is a bar chart that shows features and their relative influence; for vision problems it is also common to show the most important pixels for and against a specific prediction.

Example of an explanation describing which features were most influential in predicting a mushroom to be poisonous, from the LIME documentation.
Example of visually illustrating which pixels were most important for and against predicting this image to contain a cat, from the LIME documentation.

Explaining a prediction in terms of the most important feature influences is an intuitive and contrastive explanation. It may be useful for debugging problems. However, unless the models only use very few features, explanations usually only show the most influential features for a given prediction.

When we do not have access to the model internals, feature influences can be approximated through techniques like LIME and SHAP. LIME is a relatively simple and intuitive technique, based on the idea of surrogate models. However, instead of learning a global surrogate model from samples in the entire target space, LIME learns a local surrogate model from samples in the neighborhood of the input that should be explained. The local decision model attempts to explain nearby decision boundaries, for example, with a simple sparse linear model; we can then use the coefficients of that local surrogate model to identify which features contribute most to the prediction (around this nearby decision boundary).

For illustration, in the figure below, a nontrivial model (of which we cannot access internals) distinguishes the grey from the blue area, and we want to explain the prediction for “grey” given the yellow input. We first sample predictions for lots of inputs in the neighborhood of the target yellow input (black dots) and then learn a linear model to best distinguish grey and blue labels among the points in the neighborhood, giving higher weight to inputs nearer to the target. The learned linear model (white line) will not be able to predict grey and blue areas in the entire input space, but will identify a nearby decision boundary. From this model, by looking at coefficients, we can derive that both features x1 and x2 move us away from the decision boundary toward a grey prediction. The full process is automated through various libraries implementing LIME.

Visual explanation of how LIME learns a local decision boundary by Christoph Molnar. “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.” 2019

In addition to LIME, Shapley values and the SHAP method have gained popularity, and are currently the most common method for explaining predictions of black-box models in practice, according to the recent study of practitioners cited above. They provide local explanations of feature influences, based on a solid game-theoretic foundation, describing the average influence of each feature when considered together with other features in a fair allocation (technically, “The Shapley value is the average marginal contribution of a feature value across all possible coalitions”). Similar to LIME, the approach is based on analyzing many sampled predictions of a black-box model. The measure is computationally expensive, but many libraries and approximations exist. We recommend Molnar’s Interpretable Machine Learning book for an explanation of the approach.

Again, blackbox explanations are not necessarily faithful to the underlying models and should be considered approximations. In addition, especially LIME explanations are known to be often unstable.


Where feature influences describe how much individual features contribute to a prediction, anchors try to capture a sufficient subset of features that determine a prediction. For example, if a person has 7 prior arrests, the recidivism model will always predict a future arrest independent of any other features; we can even generalize that rule and identify that the model will always predict another arrest for any person with 5 or more prior arrests.

In a nutshell, an anchor describes a region of the input space around the input of interest, where all inputs in that region (likely) yield the same prediction. Ideally, the region is as large as possible and can be described with as few constraints as possible. Anchors are easy to interpret and can be useful for debugging, can help to understand which features are largely irrelevant for a decision, and provide partial explanations about how robust a prediction is (e.g., how much various inputs could change without changing the prediction). They even work when models are complex and nonlinear in the input’s neighborhood.

Anchors are straightforward to derive from decision trees, but techniques have been developed also to search for anchors in predictions of black-box models, by sampling many model predictions in the neighborhood of the target input to find a large but compactly described region. These techniques can be applied to many domains, including tabular data and images.

Example of an anchor detected in an object recognition model for the concept “beagle” from Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Anchors: High-precision model-agnostic explanations.” In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

Counterfactual Explanations

Counterfactual explanations describe conditions under which the prediction would have been different; for example, “if the accused had one fewer prior arrests, the model would have predicted no future arrests” or “if you had $1500 more capital, the loan would have been approved.” Counterfactual explanations are intuitive for humans, providing contrastive and selective explanations for a specific prediction. Counterfactual explanations can often provide suggestions for how to change behavior to achieve a different outcome, though not all features are under a user’s control (e.g., none in the recidivism model, some in loan assessment).

For every prediction, there are many possible changes that would alter the prediction, e.g., “if the accused had one fewer prior arrest”, “if the accused was 15 years older”, “if the accused was female and had up to one more arrest.” This is also known as the Rashomon effect after the famous movie by the same name in which multiple contradictory explanations are offered for the murder of a Samurai from the perspective of different narrators. Typically, we are interested in the example with the smallest change or the change to the fewest features, but there may be many other factors to decide which explanation might be the most useful.

There are many strategies to search for counterfactual explanations. If internals of the model are known, there are often effective search strategies, but also for black-box models search is possible. In the simplest case, one can randomly search in the neighborhood of the input of interest until an example with a different prediction is found. With access to the model gradients or confidence values for predictions, various more tailored search strategies are possible (e.g., hill climbing, Nelder–Mead). Search strategies can use different distance functions, to favor explanations changing fewer features or favor explanations changing only a specific subset of features (e.g., those that can be influenced by users). In a sense, counterfactual explanations are a dual of adversarial examples (see security chapter) and the same kind of search techniques can be used.


Predictions based on the k-nearest neighbors are sometimes considered inherently interpretable (assuming an understandable distance function and meaningful instances) because predictions are purely based on similarity with labeled training data and a prediction can be explained by providing the nearest similar data as examples. For example, car prices can be predicted by showing examples of similar past sales. Providing a distance-based explanation for a black-box model by using a k-nearest neighbor approach on the training data as a surrogate may provide insights but is not necessarily faithful.

Some recent research has started building inherently interpretable image classification models by mapping parts of the image to similar parts in the training data, hence also allowing explanations based on similarity (“this looks like that”).

Understanding the Data

Finally, there are several techniques that help to understand how the training data influences the model, which can be useful for debugging data quality issues. We briefly outline two strategies.

Prototypes are instances in the training data that are representative of data of a certain class, whereas criticisms are instances that are not well represented by prototypes. In a sense criticisms are outliers in the training data that may indicate data that is incorrectly labeled or data that is unusual (either out of distribution or not well supported by training data). In the recidivism example, we might find clusters of people in past records with similar criminal history and we might find some outliers that get rearrested even though they are very unlike most other instances in the training set that get rearrested. It might encourage data scientists to possibly inspect and fix training data or collect more training data. They can be identified with various techniques based on clustering the training data.

Prototypes and criticisms for two types of dog breeds from the ImageNet dataset, from Christoph Molnar. “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.” 2019

Another strategy to debug training data is to search for influential instances, which are instances in the training data that have an unusually large influence on the decision boundaries of the model. Influential instances are often outliers (possibly mislabeled) in areas of the input space that are not well represented in the training data (e.g., outside the target distribution), as illustrated in the figure below. For example, we may have a single outlier of an 85-year old serial burglar who strongly influences the age cutoffs in the model. Influential instances can be determined by training the model repeatedly by leaving out one data point at a time, comparing the parameters of the resulting models.

Example of how a single data point can influence the slope of the regression model dramatically, from Christoph Molnar. “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.” 2019

There are lots of other ideas in this space, such as identifying a trustest subset of training data to observe how other less trusted training data influences the model toward wrong predictions on the trusted subset (paper), to slice the model in different ways to identify regions with lower quality (paper), or to design visualizations to inspect possibly mislabeled training data (paper).

Gaming Models with Explanations

Designers are often concerned about providing explanations to end users, especially counterfactual examples, as those users may exploit them to game the system. For example, users may temporarily put money in their account if they know that a credit approval model makes a positive decision with this change, a student may cheat on an assignment when they know how the autograder works, or a spammer might modify their messages if they know what words the spam detection model looks for.

If models use robust, causally related features, explanations may actually encourage intended behavior. For example, in the recidivism model, there are no features that are easy to game. From the internals of the model, the public can learn that avoiding prior arrests is a good strategy of avoiding a negative prediction; this might encourage them to behave like a good citizen. As another example, a model that grades students based on work performed requires students to do the work required; a corresponding explanation would just indicate what work is required. Thus, a student trying to game the system will just have to complete the work and hence do exactly what the instructor wants (see the video “Teaching teaching and understanding understanding” for why it is a good educational strategy to set clear evaluation standards that align with learning goals).

Models become prone to gaming if they use weak proxy features, which many models do. Many machine-learned models pick up on weak correlations and may be influenced by subtle changes, as work on adversarial examples illustrate (see security chapter). For example, we may not have robust features to detect spam messages and just rely on word occurrences, which is easy to circumvent when details of the model are known. Similarly, we likely do not want to provide explanations of how to circumvent a face recognition model used as an authentication mechanism (such as Apple’s FaceID).

Protecting models by not revealing internals and not providing explanations is akin to security by obscurity. It may provide some level of security, but users may still learn a lot about the model by just querying it for predictions, as all black-box explanation techniques in this chapter do. Increasing the cost of each prediction may make attacks and gaming harder, but not impossible. Protections through using more reliable features that are not just correlated but causally linked to the outcome is usually a better strategy, but of course this is not always possible.

Designing User Interfaces with Explanations

While explanations are often primarily used for debugging models and systems, there is much interest in integrating explanations into user interfaces and making them available to users.

Molnar provides a detailed discussion of what makes a good explanation. In a nutshell, contrastive explanations that compare the prediction against an alternative, such as counterfactual explanations, tend to be easier to understand for humans. To be useful, most explanations need to be selective and focus on a small number of important factors — it is not feasible to explain the influence of millions of neurons in a deep neural network. Good explanations furthermore understand the social context in which the system is used and are tailored for the target audience; for example, technical and nontechnical users may need very different explanations. Explanations that are consistent with prior beliefs are more likely to be accepted. And of course, explanations are preferably truthful.

For high-stakes decisions that have a rather large impact on users (e.g., recidivism, loan applications, hiring, housing), explanations are more important than for low-stakes decisions (e.g., spell checking, ad selection, music recommendations). For high-stake decisions explicit explanations and communicating the level of certainty can help humans verify the decision; fully interpretable models may provide more trust. In contrast, for low-stakes decisions, automation without explanation could be acceptable or explanations could be used to allow users to teach the system where it makes mistakes — for example, a user might try to see why the model changed spelling, identifying a wrong pattern learned, and giving feedback for how to revise the model. Google’s People + AI Guidebook provides several good examples on deciding when to provide explanations and how to design them.

Furthermore, in many settings explanations of individual predictions alone may not be enough, but much more transparency is needed. For example, a recent study analyzed what information radiologists want to know if they were to trust an automated cancer prognosis system to analyze radiology images. The radiologists voiced many questions that go far beyond local explanations, such as

  • How does it perform compared to human experts?
  • What is difficult for the AI to know? Where is it too sensitive? What criteria is it good at recognizing or not good at recognizing?
  • What data (volume, types, diversity) was the model trained on?
  • Does the AI assistant have access to information that I don’t have? Does it have access to any ancillary studies? Is all used data shown in the user interface?
  • What kind of things is the AI looking for? What is it capable of learning? (“Maybe light and dark? Maybe colors? Maybe shapes, lines?”, “Does it take into consideration the relationship between gland and stroma? Nuclear relationship?”)
  • Does it have a bias a certain way? (compared to colleagues)

Notice how potential users may be curious about how the model or system works, what its capabilities and limitations are, and what goals the designers pursued. Here conveying a mental model or even providing training in AI literacy to users can be crucial.

That is, explanation techniques discussed above are a good start, but to take them from use by skilled data scientists debugging their models or systems to a setting where they convey meaningful information to end users requires significant investment in system and interface design, far beyond the machine-learned model itself (see also human-AI interaction chapter).

The Dark Side of Explanations

Explanations can be powerful mechanisms to establish trust in predictions of a model. Unfortunately, such trust is not always earned or deserved.

First, explanations of black-box models are approximations, and not always faithful to the model. In this sense, they may be misleading or wrong and only provide an illusion of understanding. For high-stakes decisions such as recidivism prediction, approximations may not be acceptable; here, inherently interpretable models that can be fully understood, such as the scorecard and if-then-else rules at the beginning of this chapter, are more suitable and lend themselves to accurate explanations, of the model and of individual predictions.

Cynthia Rudin makes a forceful argument to stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. She argues that in most cases, interpretable models can be just as accurate as black-box models, though possibly at the cost of more needed effort for data analysis and feature engineering. She argues that transparent and interpretable models are needed for trust in high-stakes decisions, where public confidence is important and audits need to be possible. When outside information needs to be combined with the model’s prediction, it is essential to understand how the model works. In contrast, she argues, using black-box models with ex-post explanations leads to complex decision paths that are ripe for human error.

Second, explanations, even those that are faithful to the model, can lead to overconfidence in the ability of a model, as shown in a recent experiment. In situations where users may naturally mistrust a model and use their own judgement to override some of the model’s predictions, users are less likely to correct the model when explanations are provided. Even though the prediction is wrong, the corresponding explanation signals a misleading level of confidence, leading to inappropriately high levels of trust.

Third, most models and their predictions are so complex that explanations need to be designed to be selective and incomplete. In addition, the system usually needs to select between multiple alternative explanations (Rashomon effect). Users may accept explanations that are misleading or capture only part of the truth. Even if a right to explanation was prescribed by policy or law, it is unclear what quality standards for explanations could be enforced. This leaves many opportunities for bad actors to intentionally manipulate users with explanations.


Machine learning can learn incredibly complex rules from data that may be difficult or impossible to understand to humans. Yet some form of understanding is helpful for many tasks, from debugging, to auditing, to encouraging trust. These days most explanations are used internally for debugging, but there is a lot of interest and in some cases even legal requirements to provide explanations to end users.

While some models can be considered inherently interpretable, there are many post-hoc explanation techniques that can be applied to all kinds of models. It is possible to explain aspects of the entire model, such as which features are most predictive, to explain individual predictions, such as explaining which small changes would change the prediction, to explaining aspects of how the training data influences the model.

For designing explanations for end users, these techniques provide solid foundations, but many more design considerations need to be taken into account, understanding the risk of how the predictions are used and the confidence of the predictions, as well as communicating the capabilities and limitations of the model and system more broadly.

Finally, unfortunately explanations can be abused to manipulate users and post-hoc explanations for black-box models are not necessarily faithful. Some researchers strongly argue that black-box models should be avoided in high-stakes situations in favor of inherently interpretable models that can be fully understood and audited.

Further readings

As all chapters, this text is released under Creative Commons 4.0 BY-SA license.

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Love podcasts or audiobooks? Learn on the go with our new app.