Planning for Machine-Learning Mistakes

30 min readSep 13, 2022

This post covers some content from the “Requirements and Risk” lectures of our Machine Learning in Production course. For other chapters see the table of content.

Drivers, whether human or autonomous systems, may make mistakes. Anticipating this, we can design the environment to be safer, for example, installing bollards to protect pedestrians and buildings from driver mistakes (see also *World Bollard Association*).

Predictions from machine-learned models in a software system may be wrong. Depending on the system design, wrong predictions can cause the system to display misleading information or take problematic actions, both of which can cause various problems and harms, from confusing users, to discrimination, to financial loss, to injury and death. Better models may make fewer mistakes, but mistakes are generally inevitable. Mistakes of machine-learned models also are not generally understandable or anticipatable — it is better to think of them simply as unreliable components. As a consequence, it is important to consider how to design an overall system that provides value and mitigates risks even in the presence of inevitable mistakes.

While it is difficult to anticipate specific mistakes (and it may often not even be clear what the “correct” prediction would be, as we will discuss at length in chapter Model Quality), it is possible and advisable to plan for the existence of mistakes. In this chapter, we discuss techniques to identify possible mistakes and their consequences in the system, as well as design options to minimize risks and harms from a model.

The Docklands Light Railway system in London has operated trains without a driver since 1987. Many modern public transportation systems use increasingly sophisticated automation, including the Paris Métro Line 14 and the Copenhagen Metro (Picture CC BY 2.0 by *Matt Brown*).

Throughout this chapter, we use two running examples with different degrees of risks: one of an autonomous light rail system and one of an extension to an email client suggesting answers to emails.

*Suggestions for email responses in GMail augment the user interface rather than automating responses or prompting users.*

Mistakes Will Happen

It is a useful abstraction to consider a machine-learned model as an unreliable component within a system that may make mistakes for unknown reasons at unknown times. A system designer should always anticipate that a model’s prediction will eventually be wrong, whether it is detecting obstacles in the path of a train or suggesting answers for an email. With that assumption, the challenge is simply to design a system such that mistakes of the machine-learning components do not trigger problems in the overall system.

Even if we cannot expect to ever train “perfectly correct” models, understanding common problems will help us better appreciate why mistakes happen and help us improve models to make mistakes less frequently.

Why ML models fail

There are many reasons why models may learn wrong concepts or have blind spots without being aware about them. We discuss a few common ones without trying to be comprehensive.

Correlation vs causation. Machine learning largely relies on correlations in data, but cannot generally identify which correlations are due to a causal underlying relationship and which may just stem from noise in the data or decisions in the data collection process. For example, there are lots of reports of object detection models using the background to identify what is in the foreground, such as identifying dogs in snow as wolfs or not identifying cows when they are on a beach — the model relies on the correlation of the typical background of images of various animals in the training data rather than learning the abstractions a human would use to recognize the visual appearance of animals. We will explore this further in the context of shortcut learning and adversarial attacks in chapters Model Quality and Security.

Example of an entirely spurious correlation between Maine’s divorce rate and the consumption of margarine based on data from the National Vital Statistics Reports and the U.S. Department of Agriculture (Figure CC BY 4.0, *Tyler Vigen*).

Humans often have an idea of what causal relationships are plausible or expected, whereas most machine-learning algorithms work purely with correlations in training data and cannot judge which correlations are actually reliable to make predictions.

Confounding variables. Machine learning (and also human researchers) may attribute effects to variables that are correlated but not causally linked, while missing the link to a confounding variable. For example, in 2020 studies analyzing patient data found a correlation between male baldness and symptomatic COVID infections, but likely the effect is simply explained with age being a confounder that both causes baldness and higher COVID risk. If the confounding variable is contained in the data, the machine-learning algorithm may have a chance to identify the real relationship and compensate for the confounding variable, but often such data may not be present.

Spurious correlation in the data may be explained by a causal relationship with a third confounding variable.

Reverse causality. Machine learning might attribute causality in the wrong direction when finding a correlation. For example, early chess programs trained with Grandmaster games are reported to have learned that sacrificing a queen is a winning move, because it occurred frequently in winning games; a machine learning algorithm may observe that hotel prices are high when there is high demand and thus recommend to set a high price to create demand — in both cases misunderstanding the causality underlying the correlations and making wrong predictions with high confidence.

Missing counterfactuals. Training data often does not indicate what would have happened in different situations or if different actions were taken, which makes attributing causality difficult. For example, we can observe past stock prices to attempt to predict future ones, but we do not actually know what would have happened under different market conditions or different decisions to merge or not merge with a competitor. Machine learning might try to observe causality from differences among many similar observations, but it is often challenging to collect convincing “what if” data.

Out of distribution data. Machine-learned models are generally more confident when predictions are requested about data that is similar to data seen during training. We tend to speak of out-of-distribution inputs when the input does not fit the distribution of data used during training. A model may extrapolate learned rules to new data, without realizing that additional rules are needed for this kind of data. For example, a model in an autonomous train may accurately detect adult pedestrians and workers on the track in camera images, but does not detect children or wheelchair users that were never part of the training data. Humans are often much better at handling out-of-distribution data through common sense reasoning, beyond correlations found in some training data. Confidence scores of various models and specific out-of-distribution detection techniques can often help identify such cases — we can anticipate that the predictions may be of low quality and possibly rely on them less (“known unknowns”).

Other causes. Many other problems are caused by insufficient training data, low quality training data, insufficient features in training data, biased training data, overfitting and underfitting of models, and many other issues. We will discuss several of these in chapters Model Quality and Data Quality. Again, we can often identify problems and improve models, but we cannot expect models to be perfect and should always anticipate that mistakes may happen.

Mistakes are usually not random

When reasoning about failures in physical processes, we usually assume an underlying random process about which we can reason probabilistically. For example, statistical models can predict how likely a steel axle of a train is to break over time under different conditions and corrosion levels, and two axles will have the same chance of breaking independently allowing us to stochastically reason about the reliability of more complex composed structures with redundancies. Software bugs do not usually have this nature: Software does not randomly fail for inputs, but for specific inputs that trigger a bug. Multiple copies of the software will fail for the same inputs; even multiple independent implementations of the same algorithm have often been observed to contain similar bugs and fail for many of the same inputs.

While it is tempting to reason about mistakes of machine-learned models stochastically, given how we report accuracy as percentages, this is a dangerous path. Mistakes of models are not necessarily randomly distributed. Mistakes may be associated in non-obvious ways with specific characteristics of the input or affect certain subpopulations more than others (see also chapters Model Quality: Slicing and Fairness). For example, a train’s obstacle detection system may work well in almost all cases except that it generally has a hard time recognizing wheelchair users; it may have an average accuracy of 99.99% but still fails consistently in specific kinds of cases that do not occur frequently in the test data. We can monitor mistakes in production to get a more reliable idea of the frequency of mistakes than in offline evaluations (see chapter Quality Assurance in Production), but today’s mistakes may not be representative of tomorrow’s in terms of frequency or severity.

While many machine learning algorithms to train models are nondeterministic, they mostly produce models that are deterministic during inference and will consistently make the same mistake for the same input. Also multiple models trained on the same data may tend to make similar mistakes, for example if the data contains spurious correlations.

Furthermore, attackers can attempt to deliberately craft or manipulate inputs to induce mistakes (see chapter Security). This way, even a model with 99.99% accuracy in an offline evaluation can produce mostly wrong predictions when attacked. For example, an attacker might trick the obstacle detection model of an autonomous train with a sticker on the platform, which is wrongly recognized as an obstacle and blocks the train’s operation (a previously demonstrated adversarial attack in the context of traffic signs).

Also confidence scores produced with predictions by many models need to be interpreted with care. Even high-confidence predictions can be wrong. This is even true for calibrated models, which are models where confidence scores correspond (on average) with the actual correctness of the prediction.

As a consequence, we simply recommend that system designers consider mistakes as inevitable without being able to accurately anticipate distribution or even frequency of mistakes in future deployments. While we can attempt to learn better models that make fewer mistakes, it is a good mental model for a system designer to prepare for every single prediction to be possibly wrong.

Designing for Failures

Machine-learned models are components in software systems and the fact that the model is an unreliable component that may make mistakes does not mean that the entire system is necessarily faulty, unsafe, or unreliable. The way the ML component is used in the system and how it interacts with other components determines the system behavior. That is, how is the obstacle detection system used in the train to avoid collisions but also avoid unnecessary interruptions, even if the obstacle detection system may sometimes mistakenly report obstacles or miss obstacles. This is why understanding requirements for the entire system is so important to design systems that meet the users’ needs even if some ML components may make mistakes.

Human-AI interaction design (human in the loop)

Systems use machine learning typically to influence the world in some way, either by acting autonomously or by providing information to users who can then act on it. Reducing human involvement is often a key goal of using machine learning components in software systems, in order to reduce cost, improve response time or throughput, reduce bias, or remove tedious activities thus letting humans focus on more interesting tasks. Nonetheless, including humans in the loop to notice and override mistakes is a very common and often natural approach to deal with mistakes of machine-learned models. Yet, designing human-AI interactions in systems is challenging and there is a vast design space of possible options. Users can interact with predictions from machine-learning components in different ways. Key design considerations include:

How to use the prediction of a model to achieve an objective (see chapter Setting and Measuring Goals)?
How to present predictions to a user? Suggestions or automatically take actions? How to effectively influence the user’s behavior toward the system’s goal?
How to minimize the consequences of wrong predictions?
How to collect data to continue to learn from users and mistakes?

As outlined by Geoff Hulten, typically human-AI design needs to balance at least three goals: (1) achieving system objectives, (2) protecting from mistakes, and (3) collecting telemetry data to continuously improve. Different designs make different tradeoffs and prioritize goals differently.

As already mentioned in chapter From Models to Systems, we can distinguish different common modes of interactions, which provide a good starting point for discussing possible designs:

Automate: The system takes an action on the user’s behalf based on a model’s prediction without involving the user in the decision. For example, an autonomous system controls the acceleration and doors of a driverless train, a spam filter deletes emails, a smart thermostat adjusts the home temperature, and an automated trading system buys and sells stocks. Automation takes humans out of the loop and allows for great efficiency and reaction times, but does not allow for users to check predictions or actions, and only some actions are reversible. It is most useful when we have high confidence in correct predictions or the potential costs from mistakes are low or can be mitigated in other ways.
Prompt: The system prompts a user to take an action, which the user can follow or decline. For example, an object detection system may alert an operator and ask for confirmation before leaving the station, a tax software’s model may suggest checking certain deductions common with entered expenses before proceeding, a navigation system might suggest that it is time for a break and ask whether to add a nearby roadside attraction as a stop, or a fraud detection system may ask whether a recent credit card transaction was fraudulent. Prompts keep humans in the loop to check decisions and take actions, but prompts can be disruptive, requiring users to invest cognitive effort into a decision right now, and too frequent prompts can be annoying and lead to notification fatigue where users become complacent and start to just ignore or blindly click away prompts. Prompts are useful when a model’s confidence is low or the costs from a mistake are high, and when prompts do not occur too frequently.
Organize, annotate, or augment: The system uses a model to decide what information to show to users and in what order. Information may be shown prominently in the user interface or in more subtle ways. In all cases, information is shown, but it is left up to the users to decide whether and how to react. For example, possible answers for an email are shown nonintrusively in the GMail interface below the email, which users can click or simply ignore; a safety system highlighting people near doors in a train operator’s camera feeds; a music streaming service may offer personalized playlists; a grammar checker might underline detected problems. Alternatively, curated information is provided when requested by the user, such as a model providing search results or a cancer prognosis model highlights areas in an image for the radiologist to explore more closely. In all these cases, the system simply provides several options but leaves decisions and actions to the user. These approaches are less intrusive and do not demand immediate action. Since humans are making final decisions, such approaches may work well when mistakes are common or there is no single correct prediction in the first place.

Notice how these designs differ significantly in how forceful the interactions are, from full automation to just providing information, and how they keep humans in the loop to different degrees. Of course hybrid modes are possible; for example the autonomous system in a train may automate most operations, but fall back on prompting a (possibly remote) human operator when a detected obstacle in front of the train does not move away within 20 seconds. There are many factors that go into deciding on a design, but understanding expected frequency and cost of interactions, value of a correct prediction, and cost of a wrong prediction, and the degree to which users have the ability and knowledge to make decisions with the provided information will help to guide the design process.

Generally, for boring and repetitive tasks with limited potential for harm or where harms can be mitigated with other means, more automated designs are common, whereas high-stake tasks that need accountability or tasks that users enjoy performing will tend to keep humans in the loop. Furthermore, as we will discuss in chapter Interpretability and Explainability, providing explanations with predictions can influence strongly how humans interact with the system and to what degree they develop trust or are manipulated into overtrusting the system.

In this book, we will not dive deeper into the active and developing field of human-AI interaction design. There are many interesting questions, such as (1) whether users actually have a good mental model of how the system or the model works and how to convey a suitable model, (2) how to set reasonable expectations for what can be expected and what the system can and cannot do and why mistakes may be inevitable, and (3) how to communicate how users can improve or customize the system with additional feedback in case of mistakes.

Undoable actions

If actions taken by a system or its users based on wrong predictions are reversible, harms may be smaller or temporary. For example, if a smart thermostat automatically sets the room temperature, a user can simply override that action and soon return to a comfortable temperature; if a smart slide designer changes the slide layout users can simply undo the step to restore the previous design. It may also be possible to design the system in such a way that actions taken based on (unreliable) model predictions are explicitly reversible, such as keeping a history of a document or providing free return shipping for a system that curates and sends semi-monthly packages with personalized clothing. Making actions undoable clearly does not work for all actions taken due to wrong predictions, since many have permanent consequences; for example, undoing structural damage from a train collision might be possible at extended repair cost, but a life lost in the collision cannot be undone.

In many cases, users can undo actions themselves as part of their normal interactions with the system, such as resetting the thermostat’s temperature or undoing slide design changes. In others, system designers may provide explicit paths to appeal and overturn automated decisions, typically by involving humans or extra steps. For example, a fraud detection system may block a seller as a bot on an online platform, but the system may provide a mechanism to appeal to human oversight or to use an identity verification mechanism where the user can upload a picture of a government id.

Guardrails

In many systems, machine-learned models may be used to inform possible actions, but the prediction is not used or passed on directly, but instead processed in additional steps.

When suggested automated responses for emails, guardrails can be used to avoid inappropriate or offensive recommendations, for example, by simply filtering responses using a list of banned words. In contemporary autonomous train systems, guardrails are ubiquitous: These systems usually run on their own separated and fenced off track that reduces the chance of obstacles substantially, often even adding platform screen doors at substantial cost to prevent passengers from falling on tracks. Furthermore, a vision model that detects people in doors of the autonomous train would not be considered sufficient, but doors are designed with additional pressure sensors to detect if blocked.

*Metro station Cour Saint-Émilion in Paris with automated platform screen doors that only open when a train is in the station (CC BY-SA 4.0 by* *Chabe01*).

Guardrails often rely on non-ML mechanisms, such as limiting predictions to hardcoded ranges of acceptable values, providing overrides for common mistakes or known inputs, applying hardcoded heuristic rules (such as filtering select words), or using hardware components such as platform screen doors and thermal fuses (from the smart toaster example in chapter From Models to Systems). Guardrails can also be quite sophisticated, such as the use of a street map in addition to vision based systems in autonomous vehicles or dynamically adjusting the tolerance for mistakes in obstacle detection with the current speed. It is also possible to use other machine-learned models, for example, using a sentiment analysis model or a toxicity detection model to identify and filter suggested email responses that are problematic. There can still be problems when both the original model and the model used as a safeguard each make a mistake, but if these models are largely independent the risk of both models making mistakes is much lower than of having just the original model fail.

Mistake detection and recovery

System designers often consider multiple redundancies to detect when things go wrong. For example, a monitoring system observes whether a server is still alive and responsive and multiple independent sensors in an autonomous train sense speed and surroundings. There are a large number of safety design strategies that rely on the ability to detect problems to mitigate harms or recover.

Doer-Checker pattern. The Doer-Checker pattern is a classic safety pattern where a component performing actions, the doer, is monitored by a second component, the checker. If the checker can determine the correctness of the doer’s work to some degree it can intervene with a corrective action if a mistake is detected, even if the doer is untrusted and potentially faulty or unreliable. Corrective actions can include providing new alternative outputs, providing a less desirable but safe fallback response, or shutting down the entire system. This pattern relies heavily on the ability to detect mistakes. As detecting a wrong prediction directly is usually hard (otherwise we might not need the prediction in the first place), detection often relies on indirect signals, such as user reaction or independent observations of outcomes. For example, in an autonomous train, a safety controller acting as a checker might observe the train leaning dangerously far during a curve when speed is controlled by a machine-learned model and intervene with corrective breaking commands — here the checker does not directly check the acceleration command outputs or speed but the effect on the train as assessed by independent gyro sensors.

Graceful degradation (fail-safe). When a component failure is detected, graceful degradation is an approach to reduce functionality and performance at the system level to maintain safe operation. That is, rather than continuing to operate with the known faulty component or shutting down the system entirely, the system relies on backup strategies to achieve its goal even if it does so at lower quality and lower speed. For example, when the email response recommendation service goes down, the email system simply continues to operate without that feature. As a more physical example, if the lidar sensor in an autonomous train fails, the train may continue to operate with a vision-based system but only at lower speeds while maintaining larger distances to potential obstacles; if there is no full backup system, a minimal emergency component may be able to stop the train safely and alert a (remote) operator or maintenance team.

Redundancy

Implementing redundancy in a system is a common strategy to defend against random errors. For example, installing two cameras to monitor doors in an autonomous train ensures that the system continues to operate even if the hardware of one camera fails, the system simply switches to use the other camera after detecting the failure (known as hot standby). Redundancy can also be used beyond just swapping out components that fail entirely by making decisions based on comparing the outputs of multiple redundant components, for example through voting. For example, we could use three independent components to detect obstacles in an autonomous train and use the median or worst-case value.

Redundancy unfortunately does not help if the redundant parts fail under the same conditions: Running multiple instances of the same non-ML algorithm or multiple instances of the same ML model typically produces the same outputs. In practice, even independently implemented algorithms and models independently trained on the same data often exhibit similar problems, making redundancy less effective for software than for hardware. Many machine-learning approaches already use some form of redundancy and voting internally to improve predictions, known as ensemble learning, for example, in random forest classifiers. Generally, redundancy is more effective if the different redundant computations are substantially different, such as combining data from different sensors — for example, the autonomous train may use long range radar, short range radar, LiDAR, and vision, thermal camera, combined with information from maps, GPS, accelerometers, and a compass.

Note that redundancy can be expensive though. Additional sensor systems in an autonomous vehicle can add substantial hardware cost and the existing onboard hardware is often already pushed to the limit with existing computations. System designers will need to consider tradeoffs between reducing mistakes and increasing hardware cost and computational effort (see also chapters Thinking like a software architect and Quality Attributes of ML Components).

Containment and isolation

A classic strategy to design a system with unreliable components is to contain mistakes of components and ensure that they do not spread to other parts of the system. A traditional example is separating the fly-by-wire system to control a plane from the entertainment system for the passengers, so that the latter (typically built to much lower quality standards) can crash and be rebooted without affecting the safety of the plane — the same would hold for separating the control system for the autonomous train from components for station announcements and on-board advertisement. There are lots of examples of past disasters where systems performed poor containment, such as a US Navy ship having to reboot the entire control system for 3h after some data entry in a database application caused a division-by-zero error that propagated and crashed a central component of the ship or cars that can be hacked and stopped remotely through the entertainment system. As a general principle, unreliable and low-critical components should not impact high-critical components.

With machine learning, we do not usually worry about inputs crashing the inference service or exploits causing the inference service to manipulate data in other components, so traditional isolation techniques such as sandboxing, firewalls, and zero-trust architectures seem less relevant for containing model-inference components. Rippling effects usually occur through data flows when a model makes a wrong prediction that causes problems in other components. It can therefore be prudent to carefully consider what parts of the system depend on the predictions of a specific model, what happens when the model is not available, and how wrong predictions can cause downstream effects. In addition, we often should worry about timing-related consequences occurring when the model-inference service responds late or is overloaded. For all of this, the hazard analysis and risk analysis tools we discuss next can be helpful.

Hazard Analysis and Risk Analysis

Traditional safety engineering methods can help to anticipate mistakes and their consequences. They help understand how individual mistakes in components can result in failures and bad outcomes at the system level. Many of these safety engineering methods, including all we will discuss here, have a very long history and are standard tools for safety engineering in traditional safety-critical domains such as avionics, medical devices, and nuclear power plants. With the introduction of machine learning as unreliable components, anticipating and mitigating mistakes gets a new urgency, even in systems that are not traditionally safety critical: With machine learning, we tend to try to solve ambitious problems with models that we do not fully understand, so even seemingly harm web applications or mobile apps often have potential for harm, be it from causing stress, financial loss, discrimination, pollution, or bodily harm (see also chapters on responsible AI engineering). Investing some effort in anticipating these problems early on can improve the user experience, avoid bad press, and avoid actual harms.

Note that none of the methods we discuss will provide formal guarantees or can assure that all failures are avoided or at least anticipated. These methods are meant to provide structure and a deliberate process in a best effort approach, to improve over the common practice of thinking through mistakes in an ad hoc manner like unstructured brainstorming, if at all. These methods all foster deliberate engagement with thinking through possible failures and their consequences and thinking about how to avoid them before they cause harm. They can guide groups of people, including engineers, domain experts, safety experts, and other stakeholders through a collaborative process. The resulting documents are broadly understandable and can be updated and referenced later.

Fault Tree Analysis

Fault tree analysis is a method for documenting what conditions can lead to a failure of the system. Fault trees are typically represented as a top-down diagram that displays the relationship between a system failure (the root) with its potential causes, where causes can be, among others, component failures or unanticipated environment conditions. In the presence of existing safety mechanisms, there are typically chains of multiple events that must occur together to cause the system failure. Also there are often multiple independent conditions that can trigger the same system failure. Fault trees can explore and document these complex relationships.

Fault trees are commonly used to analyze a past failure by identifying the conditions that lead to the failure and the alternative conditions that would have prevented it. It can then be used to discuss how this and similar system failures are preventable, usually by making changes to the system design.

Creating fault trees. To create a fault tree, we start with an event describing the system failure. Note that wrong predictions of models are component mistakes that can lead to a system failure, but they are not system failures themselves — a system failure should describe a requirements violation of the entire system in terms of real-world behavior. Typically system failures are associated with small or severe harms, from stress and pollution to bodily injury. For example, the autonomous train might collide with an obstacle or trap a passenger in the door and the email response system may offer or even send an inappropriate message. Then starting with this system failure event, we break down the event into more specific events that were required to trigger the failure, which may include wrong predictions. In the graphical notation and and or gates describe whether multiple subevents are required to trigger the parent event or whether a single event is sufficient. Breaking down events into smaller contributing events continues recursively as far as useful for the analysis (deciding when to stop requires some judgment). Events that are not further decomposed are called basic events in the terminology of fault tree analysis.

Consider the following example in the context of the autonomous train. We are concerned about the system failure of trapping a passenger in the door (a violation of a safety requirement). This failure can only occur when both the vision-based model does not detect the passenger in the door and also the pressure sensor in the door detecting an obstacle upon closing fails. Alternatively, this can also occur if a human operator deactivates the safety systems with a manual override. We can further decompose each of these subevents. For example, the vision-based system may fail due to a wrong prediction of the model or due to a failure of the camera or due to bad lighting conditions near the door. The pressure sensors in the door may fail if the sensor malfunctions or the software processing the sensor signals crashes. Each of these events can be decomposed further, for example, considering possible causes of poor lighting conditions or possible causes for inappropriate manual overrides.

*Partial example of a fault tree diagram for the system failure of trapping a person in the door of an autonomous train.*

Generally, as discussed, we should consider machine-learned models as unreliable, so they tend to show up prominently as events in a fault tree. While we can speculate about some reasons for failure, those are often neither necessary nor sufficient conditions. Hence, we recommend typically treating failure of a machine-learned model as a basic event, without further decomposition.

The untangling of requirements into system requirements (REQ), environmental assumptions (ASM), and software specifications (SPEC), as discussed in chapter Gathering Requirements, can be very useful to consider possible failures from different angles. The top level event corresponds to a requirements violation of the system (in the real world), but events that contribute to this top event can usually be found in wrong assumptions or unmet specifications. While it is often intuitive to list specification violations (software bugs, wrong model predictions), it is also important to question environmental assumptions and consider their violation as events in the fault tree. For example, we may have assumed about the environment that lighting conditions are good enough for the camera or that human operators are very careful when overriding the door safety system — and violations of these assumptions can contribute to the system failure.

Analyzing fault trees and designing mitigations. Fault trees show the various conditions that lead to a system failure and are good at highlighting mitigation mechanisms or the lack thereof. It is straightforward to derive the conditions under which the system fault occurs by listing the set of basic events necessary to trigger the fault — such set is called a cut set and in our example “Person in door” + “Camera defective” + “Pressure sensor fails” is one possible cut set among many. Cut sets that cannot be further reduced, because removing any one basic event would prevent the failure, are called minimal cut sets. Now a system designer can inspect these conditions and identify cases where additional mitigations are possible, either by preventing basic events or by adding additional events to the minimal cut set, for example, by design changes such as introducing safeguards, recovery mechanisms, redundancy, or keeping humans in the loop.

In our example, it seems problematic that a crash of a software module can lead to the door closing. Hence, through a system redesign we should change the default action and prevent the door from closing when the software reading the door sensor is not responsive, thus removing this basic event and thus a failure condition. In the same example, we can also harden the system against the failure of a single pressure sensor by installing two pressure sensors; now two pressure sensor failures need to occur at the same time to cause the system failure, increasing the size of the minimal cut set and making failure less likely. There is often little we can do about the possibility of mistakes from the machine-learned model, but we can make sure that additional events are needed to trigger the system fault, such as having both a vision-based and a pressure-sensor based safety control for the door, as we already do in the design of our example.

A typical fault tree analysis process goes through multiple iterations of (1) analyzing requirements including environmental assumptions and specifications, (2) constructing or updating fault trees for each requirement violation of concern, (3) analyzing the tree to identify cut sets, and (4) considering design changes to eliminate basic events or increase the size of cutsets with additional events.

Note that fault trees are never really complete. Even with careful requirements analysis we may miss events that contribute to a failure (“unknown unknowns” or “black swan events”). Domain expertise is essential in creating fault trees and performing judgment regarding to what degree to decompose events. Practitioners may revise fault trees as they learn more from discussions with domain experts or from analyzing past system failures. Also our mitigations will rarely be perfect, especially if we still have to deal with the unreliable nature of machine-learning components and the complexities of reasoning about the real world and human behavior from within software. However, even if fault trees are incomplete, they are a valuable tool to think through failure scenarios and deliberate about mitigations as part of the requirements and design process of the system to reduce the chance of failures occurring in practice.

Failure Modes and Effects Analysis (FMEA)

Where fault trees reason from a system failure backward to events that cause or contribute to the failure, the method Failure Modes and Effects Analysis (FMEA) reasons forward from component failures to system failures and corresponding hazards. Where backward search in fault tree analysis is particularly useful to analyze accidents and anticipated failures in order to improve systems, forward search with FMEA is useful to identify previously unanticipated problems.

Rather than starting with requirements, FMEA starts by identifying the components of the system, then enumerates the potential failure modes of each component, and finally identifies what consequences each component failure can have on the rest of the system and how this could be detected or mitigated. Understanding the consequences of a component failure typically requires a good understanding of how different components in the system interact and how a component contributes to the overall system behavior. FMEA is a natural fit for systems with machine-learning components: Since we can always assume that the model may make mistakes, FMEA guides us through thinking through consequences of these mistakes for each model.

In the autonomous train’s door example, the vision based system can fail by not detecting a person in the door or by detecting a person where there is none. Thinking through the consequences of these mistakes, we can find that the former can lead to possibly harming a person when closing the door and the latter can result in preventing the train from departing, causing delays. From here, we can either directly consider mitigations, such as adding a pressure sensor at the door or adding the ability for human (remote) operators to override the system, or we can use fault tree analysis to understand the identified failure and its conditions and mitigations in more detail. In our other scenario of suggested email responses, it may be worth thinking through failure modes in more detail than just “provides a wrong prediction” and analyze the ways in which the prediction may be wrong: it may be off topic, incomprehensible, misspelled, impolite, offensive, gender biased, slow to compute, or wrong in other ways — since resulting failures and harms may be different. For many machine learning tasks classifications of common mistakes already exist that can guide the analysis, such as common mistakes in object detection, common mistakes in pedestrian detection, and common mistakes in natural language inference.

FMEA is commonly documented in a tabular form, with one row per failure mode of each component. The row describes the component, the component’s failure mode, the resulting effects on the system, potential causes, potential strategies to detect the component failure (if any), and recommended actions (or mitigations). Typically also the severity of the issue is judged numerically to prioritize problems and mitigations.

*Excerpt of an FMEA table for analyzing components in an autonomous vehicle, from 🗎 David Robert Beachum.* *Methods for assessing the safety of autonomous vehicles. University of Texas Theses and Dissertations (2019).*

As fault tree analysis, FMEA does not provide any guarantees, but provides structured guidance of how to systematically think through a problem, explicitly considering the many ways each component can fail. While this may not anticipate all possible system failures, it helps to anticipate many.

Hazard and Operability Study (HAZOP)

Hazard and Operability Study (HAZOP) is another classic method that, similarly to FMEA, performs a forward analysis from component mistakes to system failures. HAZOP is fairly simple and can be thought of as a guided creativity technique to identify possible failure modes in components or intermediate results: It guides analysis to think about the components with specific guide words to come up with failure modes.

While guide words may be customized in a domain-specific way, common guide words include:

NO OR NOT: Complete negation of the design intent
MORE: Quantitative increase
LESS: Quantitative decrease
AS WELL AS: Qualitative modification/increase
PART OF: Qualitative modification/decrease
REVERSE: Logical opposite of the design intent
OTHER THAN / INSTEAD: Complete substitution
EARLY: Relative to the clock time
LATE: Relative to the clock time
BEFORE: Relating to order or sequence
AFTER: Relating to order or sequence

Some researchers have suggested machine-learning specific guide words, such as WRONG, INVALID, INCOMPLETE, PERTURBED, and INCAPABLE.

An analysis with HAZOP now considers each component or component output in combination with each of the guide words. For example, what might it mean if the obstacle detection component in the autonomous train does not detect the obstacle, does detect more of an obstacle or part of an obstacle, or does detect the obstacle late, or does increasingly make more mistakes over time as data distributions drift. Not every guide word makes sense with every component (reverse of detecting an obstacle?) and some combinations might require some creative interpretation (after?), but they can lead to meaningful failure modes that had not been considered before. The guide words can also be applied to other parts of the system, including training data, runtime data, the training pipeline, and the monitoring system, for example guiding to think about the consequences of wrong labels, not enough training data, delayed camera inputs, perturbed camera inputs, drifting distributions, and missing monitoring. From there corresponding system failures can be identified as with FMEA.

Summary

Machine-learned models will always make mistakes, there is no way around it. We may have some intuition for some mistakes, but others are completely surprising, weird, or stem from deliberate attacks. Improving models is a good path toward improving systems, but it will not eliminate all mistakes. Understanding the source of some mistakes can help us improve the models, but is not sufficient to assure their correctness. Hence it is important to consider system design and mitigation strategies that ensure that mistakes from machine-learned models do not result in serious faults of the system that may cause harm in the real world, smaller or severe. Responsible engineers will explicitly consider consequences of mistakes of models in their system to anticipate problems and design mitigations.

Classic safety engineering techniques such as fault tree analysis, FMEA, and HAZOP can help to analyze the causes of (potential) system failures and the consequences of component failures. While not providing guarantees, these techniques help to anticipate many problems and help to design a system to avoid problems or make them less likely to occur.

Once problems are anticipated, there are often many design strategies on how to compensate for model mistakes in the rest of the system with safeguards, recovery mechanisms, redundancy, or isolation, or how to design the interactions between the system and humans with different degrees of forcefulness. For example, with suitable user interaction design, we can ensure that humans retain agency and can override mistakes occurring from models, for example, by offering suggestions rather than fully automating actions or by allowing humans to undo automated actions.