Interpretability and explainability

Example: Proprietary opaque models in recidivism prediction

Northpoint’s controversial proprietary COMPAS system takes an individual’s personal data and criminal history to predict whether the person would be likely to commit another crime if released, reported as three risk scores on a 10 point scale. The idea is that a data-driven approach may be more objective and accurate than the often subjective and possibly biased view of a judge when making sentencing or bail decisions.

Interpretable model for recidivism prediction as a scorecard from Rudin, Cynthia, and Berk Ustun. “Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice.” Interfaces 48, no. 5 (2018): 449–466.
IF age between 18–20 and sex is male THEN predict arrest
IF age between 21–23 and 2–3 prior offenses THEN predict arrest
IF more than three priors THEN predict arrest
ELSE predict no arrest

Defining Interpretability, Explainability, and Transparency

There are many terms used to capture to what degree humans can understand internals of a model or what factors are used in a decision, including interpretability, explainability, and transparency. These and other terms are not used consistently in the field, different authors ascribe different often contradictory meanings to these terms or use them interchangeably. In this book, we use the following terminology:

Why Explainability?

Interpretable models and explanations of models and predictions are useful in many settings and can be an important building block in responsible engineering of ML-enabled systems in production. There are many different motivations why engineers might seek interpretable models and explanations.

Understanding a Model

We can discuss interpretability and explainability at different levels. We start with strategies to understand the entire model globally, before looking at how we can understand individual predictions or get insights into the data used for training the model.

Intrinsically Interpretable Models

Sparse linear models are widely considered to be inherently interpretable. We can inspect the weights of the model and interpret decisions based on the sum of individual factors. The model coefficients often have an intuitive meaning. Linear models can also be represented like the scorecard for recidivism above (though learning nice models like these that have simple weights, few terms, and simple rules for each term like “Age between 18 and 24” may not be trivial). In spaces with many features, regularization techniques can help to select only the important features for the model (e.g., Lasso).

Global Surrogate Models

For models that are not inherently interpretable, it is often possible to provide (partial) explanations. This can often be done without access to the model internals just by observing many predictions. There is a vast space of possible techniques, but here we provide only a brief overview.

Feature Importance

While it does not provide deep insights into the inner workings of a model, a simple explanation of feature importance can provide insights about how sensitive the model is to various inputs. Feature importance is the measure of how much a model relies on each feature in making its predictions. These are highly compressed global insights about the model.

Feature importance in the DC bike sharing data, from Christoph Molnar. “Interpretable Machine Learning.” 2019

Partial Dependence Plot (PDP)

While feature importance computes the average explanatory power added by each feature, more visual explanations such as those of partial dependence plots can help to better understand how features (on average) influence predictions. For example, in the plots below, we can observe how the number of bikes rented in DC are affected (on average) by temperature, humidity, and wind speed. These plots allow us to observe whether a feature has a linear influence on predictions, a more complex behavior, or none at all (a flat line). The plots work naturally for regression problems, but can also be adopted for classification problems by plotting class probabilities of predictions.

Example of partial dependence plots for the DC bike sharing data, from Christoph Molnar. “Interpretable Machine Learning.” 2019

Understanding a Prediction

While the techniques described in the previous section provide explanations for the entire model, in many situations, we are interested in explanations for a specific prediction. For example, we might explain which factors were the most important to reach a specific prediction or we might explain what changes to the inputs would lead to a different prediction.

Feature influences

There are many different strategies to identify which features contributed most to a specific prediction.

Example of an explanation describing which features were most influential in predicting a mushroom to be poisonous, from the LIME documentation.
Example of visually illustrating which pixels were most important for and against predicting this image to contain a cat, from the LIME documentation.
Visual explanation of how LIME learns a local decision boundary by Christoph Molnar. “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.” 2019


Where feature influences describe how much individual features contribute to a prediction, anchors try to capture a sufficient subset of features that determine a prediction. For example, if a person has 7 prior arrests, the recidivism model will always predict a future arrest independent of any other features; we can even generalize that rule and identify that the model will always predict another arrest for any person with 5 or more prior arrests.

Example of an anchor detected in an object recognition model for the concept “beagle” from Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Anchors: High-precision model-agnostic explanations.” In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

Counterfactual Explanations

Counterfactual explanations describe conditions under which the prediction would have been different; for example, “if the accused had one fewer prior arrests, the model would have predicted no future arrests” or “if you had $1500 more capital, the loan would have been approved.” Counterfactual explanations are intuitive for humans, providing contrastive and selective explanations for a specific prediction. Counterfactual explanations can often provide suggestions for how to change behavior to achieve a different outcome, though not all features are under a user’s control (e.g., none in the recidivism model, some in loan assessment).


Predictions based on the k-nearest neighbors are sometimes considered inherently interpretable (assuming an understandable distance function and meaningful instances) because predictions are purely based on similarity with labeled training data and a prediction can be explained by providing the nearest similar data as examples. For example, car prices can be predicted by showing examples of similar past sales. Providing a distance-based explanation for a black-box model by using a k-nearest neighbor approach on the training data as a surrogate may provide insights but is not necessarily faithful.

Understanding the Data

Finally, there are several techniques that help to understand how the training data influences the model, which can be useful for debugging data quality issues. We briefly outline two strategies.

Prototypes and criticisms for two types of dog breeds from the ImageNet dataset, from Christoph Molnar. “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.” 2019
Example of how a single data point can influence the slope of the regression model dramatically, from Christoph Molnar. “Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.” 2019

Gaming Models with Explanations

Designers are often concerned about providing explanations to end users, especially counterfactual examples, as those users may exploit them to game the system. For example, users may temporarily put money in their account if they know that a credit approval model makes a positive decision with this change, a student may cheat on an assignment when they know how the autograder works, or a spammer might modify their messages if they know what words the spam detection model looks for.

Designing User Interfaces with Explanations

While explanations are often primarily used for debugging models and systems, there is much interest in integrating explanations into user interfaces and making them available to users.

  • What is difficult for the AI to know? Where is it too sensitive? What criteria is it good at recognizing or not good at recognizing?
  • What data (volume, types, diversity) was the model trained on?
  • Does the AI assistant have access to information that I don’t have? Does it have access to any ancillary studies? Is all used data shown in the user interface?
  • What kind of things is the AI looking for? What is it capable of learning? (“Maybe light and dark? Maybe colors? Maybe shapes, lines?”, “Does it take into consideration the relationship between gland and stroma? Nuclear relationship?”)
  • Does it have a bias a certain way? (compared to colleagues)

The Dark Side of Explanations

Explanations can be powerful mechanisms to establish trust in predictions of a model. Unfortunately, such trust is not always earned or deserved.


Machine learning can learn incredibly complex rules from data that may be difficult or impossible to understand to humans. Yet some form of understanding is helpful for many tasks, from debugging, to auditing, to encouraging trust. These days most explanations are used internally for debugging, but there is a lot of interest and in some cases even legal requirements to provide explanations to end users.

Further readings



associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Christian Kästner

Christian Kästner


associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling