Quality Attributes of ML Components

Christian Kästner
24 min readFeb 15, 2022

--

This chapter corresponds to the “Tradeoffs among Modeling Techniques” lecture of our Machine Learning in Production course. For other chapters see the table of content.

Traditionally, data science education and data scientists in practice focus on prediction accuracy as the primary goal during model development. However, when used in a production setting, as part of a system, there are many other quality attributes of interest for models and the machine-learning algorithms used to learn them. In this chapter, we survey common qualities of interest and how to identify constraints and negotiate tradeoffs, both for the ML components and the system as a whole.

Running Example: Detecting Credit Card Fraud

While it is useful to illustrate the range of different qualities with different use cases throughout this chapter, we will use one running example of automated credit card fraud detection offered as a service to banks. Fraudulent credit card transactions come in different shapes, but there are common patterns that help with detection. Nonetheless, criminals tend to explore new strategies constantly, especially as banks detect some patterns more easily.

Consider a company that develops a new way of detecting credit card fraud with high accuracy, using a combination of classic anomaly detection with many handwritten features and a novel deep neural network model that considers customer profiles built on third party data from advertising networks beyond just the customer’s past credit-card transactions. The company offers its services to banks, who pay a (very small) fee per analyzed transaction. In addition to automated detection of fraud, the company also employs a significant number of humans to manually review transactions and follow up with the bank’s customers. Banks provide access to a real-time data feed of transactions and fraud claims.

From System Quality to Model and Pipeline Quality

A key part of requirements engineering in any software project is typically to identify the relevant quality requirements for the system, in addition to functional requirements. In traditional software projects, quality requirements may include scalability, response time, cost of operation, usability, maintainability, safety, security, and time to release. In our credit card scenario, we want to detect fraud accurately and quickly, move with the evolving fraud schemes, and make a profit on sheer volume of transactions and low human involvement.

As discussed in chapter From Model to System, machine learning components — including the learned models, the training pipeline, and the monitoring infrastructure — are part of a larger system and those components need to support the system’s quality goals:

  • Our credit card fraud detection algorithm that needs to react quickly to changing fraud patterns will not be well supported by a model that takes weeks to (re-)train and deploy. The system’s quality requirement modifiability conflicts with the machine-learning pipeline’s training latency.
  • A recommendation algorithm that needs minutes to provide a ranking will not be usable in most interactive settings in response to user queries, say on a shopping website. The system’s quality requirement of fast average response times conflicts with the model’s inference latency.
  • A voice activated smart home device is not well served by a monitoring infrastructure that sends all audio recordings to a cloud server for analysis, transferring multiple terabytes of audio data per day. The system’s quality requirements regarding privacy and operating cost conflict with the monitoring infrastructure’s goals of comprehensive monitoring.

The main observation is that there are many possible quality goals for machine-learning components in a system, not just prediction accuracy. A key step in designing the system, when decomposing the system into components, is to identify which qualities of machine-learning components are important to achieve the system’ quality requirements. Understanding the system goals can influence what design decisions are feasible within machine-learning components and how to negotiate tradeoffs between different qualities. For example, analyzing the system quality requirements of our fraud-detection system, we may realize that throughput without massive recurring infrastructure cost is essential, so we may need to make compromises when using deep learning to avoid excessive inference costs from large models or design a two stage process with a faster screening model first and more expensive model only analyzing a subset of transactions (we will come back to this in chapter Deploying a Model). Conversely, understanding limitations of machine-learning components can also inform design decisions for other parts of the system and whether the system as a whole is feasible at all with the desired qualities. For example, establishing accuracy estimates and per-transaction inference costs for fraud detection can inform how much human oversight is needed and whether a business model that relies only on very low per-transaction fees is feasible.

On terminology: Software engineers tend to speak of quality attributes, quality requirements, or non-functional requirements to describe desirable qualities of a system beyond functional requirements that describe what outputs should be produced for given inputs. Typical quality requirements include latency, safety, and usability, but the boundary between functional requirements and quality attributes is hard to define precisely. Data scientists sometimes speak of model properties when referring to accuracy, inference latency, fairness, and other qualities of a learned model, though the same term is also used for capabilities of the learning algorithm (e.g., whether the algorithm can learn non-linear relationships or whether learning is monotonic or incremental). We will use the software-engineering term quality attribute to refer to qualities of the system and its components, including machine-learning components.

Common Quality Attributes

In general, the concept of quality can be hard to pin down. Product quality can be viewed in many different ways. For example, transcendent quality refers to an experiential view where quality can be recognized but not defined or measured, as arguably with many forms of art. Product-based quality refers to levels of concrete measurable attributes where one product may have more than another, say speed or fuel efficiency of a car. User-based quality refers to how well a product fits a purpose, for example, whether a bicycle is practical for my everyday needs, whereas value-based quality incorporates the costs, say whether the bicycle meets my needs at the given costs. Manufacturing quality then again refers simply to whether a product meets a given specification and follows a prescribed process during manufacturing. There are equivalences for all of these views of quality also in the context of machine learning: a model creating enjoyable experiences when controlling opponents in a game (transcendental quality), a model smaller in memory than another (product-based quality), a model helping a user to avoid hassles from credit card fraud (user-based quality), a model saving the bank more money from preventing credit card fraud than invested in the cost for its development and operation (value-based quality), and a model audited for fairness (manufacturing quality).

In the following, we provide an overview of common quality attributes that may be relevant to consider for machine-learning algorithms and the models they learn. We leave qualities of machine-learning pipelines and monitoring infrastructure to the corresponding later chapters.

Excursion: A very very brief introduction to decision trees and deep learning

To illustrate some quality differences, we refer to some concepts from decision trees and deep learning. For readers not familiar with how those machine-learning algorithms work, we provide a very brief introduction.

Decision trees. In a nutshell, decision tree learning algorithms build models that consist of nested if decisions, which each make a decision based on a feature input (in the form of “f<x” or “f=x” where f is a feature and x is a learned constant from the domain of that feature). At the end of each decision part is the predicted label, another learned constant. In machine-learning terminology, the conditions at the decision points and the resulting labels in the leaves are the parameters of the model. During inference, simply all decisions are evaluated along a path with the provided feature vector as an input, until a leaf node is reached, of which the label is returned as the prediction.

Decision tree learned from credit card transactions, identifying that high-value transactions are more likely to be fraud based on the amount and time of the transaction as well as the average amount the customer spends and a risk score of the terminal used. Example based on data and model from Machine Learning for Credit Card Fraud Detection — Practical Handbook.

The key strategy of learning decision trees is fairly straightforward: Given a dataset of feature vectors and corresponding labels, the algorithm enumerates all possible decisions (i.e., all combinations of features and thresholds). For each decision, it measures how well the decision splits the dataset into two datasets that each contain mostly the same label (typically using entropy or a similar measure). It takes the best split, creates a decision note with the corresponding decision, and then recursively applies the same strategy to the new datasets in both leaves. Recursion ends when all rows in the dataset of a leaf have the same label or when a maximum depth has been reached (a hyperparameter). Each leaf gets assigned a label simply from the majority label of the corresponding dataset.

The learned decision tree corresponds to a function with a nested structure of if-then-else statements based on the learned decisions, returning the learned labels when a leaf node is reached. That is, during inference, we simply need to evaluate the decisions along one path through the decision tree.

Random forests are an extension of the decision tree algorithm where multiple trees are learned, each on a random subset of features and a random subset of rows from the training data. During inference, all trees are evaluated and the average label is returned as the prediction. The idea is to avoid overfitting by introducing some randomness into the learning process.

Neural networks and deep learning. In machine learning, neural networks simulate biological networks of neurons and synapses found in human brains. A basic building block is an artificial neuron with n inputs and one output, where the neuron’s output depends on the neuron’s inputs in a learned way. Typically, a weighted sum of all inputs is computed (based on learned weights for that neuron), added to a learned constant for the neuron called bias, and returned as output after passing through an activation function ϕ (e.g., ϕ(z) = if (z<0) 0 else 1). A neural network typically consists of many interconnected neurons. A typical network architecture is layered, where input nodes (nodes with output values determined by the feature vector during inference, say one input for each pixel of a picture) are each connected to neurons of a first layer, then the outputs of those neurons are connected to neurons of a second layer, and so forth, until the outputs of the last layer are interpreted as predictions (e.g., in the last layer, each neuron can be associated with a label and the label of the neuron with highest output returned as prediction). The weights of all edges and the neurons’ biases are the learned constants (“parameters” in machine-learning terminology) of a neural network.

Illustration of a neural network with 2 inputs, 5 neurons (grey circle) in two layers, and 2 outputs. Each solid edge is associated with a “weight”. The neuron’s “bias” is shown as dotted edges from activated nodes (white circle). Here neurons are grouped in layers and fully connected. Inputs could relate directly to columns of the fraud prediction data and outputs could relate to the labels fraud and not fraud.

Each edge of a neural network as a weight and the output of a neuron is computed as ϕ(b+w1·x1+w2·x2+…+wn·xn) for a node with n inputs x1 to xn with corresponding weights w1 to wn and the neuron’s bias constant b. Typically, multiple neurons using the same inputs but different weights can be automatically computed with a single matrix computation W*X+b for vector of inputs X, a vector of neuron constants b, and a matrix of weights W, before the result is processed with the threshold function. A network with multiple layers is then a sequence of interleaved matrix multiplications and step function applications, e.g. ϕ(W2·ϕ(W1·X+b1)+b2). The heavy use of matrix multiplication of floating point numbers is why neural networks benefit from GPU hardware, which performs such matrix computations much more efficiently.

Neural networks are commonly trained with the backpropagation algorithm based on labeled training data. In a nutshell, this is an incremental algorithm starting with random initial values for all weights and biases. For each training input, all neurons are evaluated as during inference. If the predicted output is not the expected label (measuring error with a loss function), the algorithm then computes how much each input to the last layer has contributed to the mistake, tweaking weights a little toward the correct output (in a gradient descent process); this process is then repeated for neurons in lower layers. Training continues until weights stabilize. That is, training requires repeated inference and computation of the influence of individual neurons, that is, a lot of floating point computations.

Deep learning simply refers to neural networks with multiple layers. The way neurons are arranged and connected is called the architecture of the neural network. The shown network above is fully connected, where each output at one level connects to every neuron on the next level. Innovations in neural network architectures include layers that are not fully connected, connections that skip layers, and networks that have cyclic connections, each encoding different ideas of how neurons work and specialized for different tasks that exceed the scope of this book.

Common quality attributes of machine-learned models

The primary quality usually considered by data scientists when building models is prediction accuracy, that is, how well the model learns the intended concepts for predictions. There are many ways to measure accuracy and break it down by different kinds of mistakes or subgroups, as we will discuss in chapter Model Quality: Measuring Prediction Accuracy.

In many production settings, time-related qualities are important. Inference latency is the measure of how long it takes to make a single prediction. Some models make near instant predictions like the log n decision evaluations in a decision tree with n internal nodes, other predictions require significant computational resources, such as evaluating a deep neural network with millions or billions of floating point multiplications. Some models have very consistent and predictable inference latency, for others, latency depends on the specific input. Common measures hence measure the median latency or the 90 percentile latency. Inference throughput is a related measure of how many predictions can be made in a given amount of time, for example, when applied during batch processing. Scalability typically refers to how throughput can be increased with increasing demands, typically by distributing the work across multiple machines, say in a cloud infrastructure. In our credit card fraud scenario, latency is likely not critical as long as it is under a few seconds, but throughput is important given the large number of transactions to be processed.

Several model quality attributes inform hardware needs during inference, such as model size and memory footprint. Model size, typically simply measured as file size of the serialized model, influences how much data must be shipped with every model update (e.g., to end users as an app update on a phone). Larger models also require more storage for versioning. Again, decision trees in practice tend to be comparably small in size, whereas even small deep neural networks can be of substantial size. For example, a typical introductory example to classify images in the MNIST Fashion dataset (28 by 28 pixel grayscale images, 10 output classes) with a three layer feed forward network of 300, 100, and 10 neurons has 266,610 parameters — if each parameter is stored as a 4 byte float, this would be 1 megabyte just for storing serialized model parameters. State-of-the-art deep neural network models are much bigger, for example, OpenAI’s GPT-3 model from 2020 has 96 layers, about 175 billion weights, and needs about 700 gigabyte in memory (one order of magnitude more memory than even high-end desktop computers in 2020 usually had).

In some settings, the energy consumption per prediction (on given hardware) is very relevant. Limits to available energy can seriously constrain what kind of models can be deployed in mobile phones and battery-powered smart devices.

In some settings it can be useful to know that predictions are deterministic, that is, a model always makes the same prediction for the same input, as it can simplify monitoring and debugging. While many learning algorithms are nondeterministic (see below), almost all models learned with machine-learning techniques are deterministic during inference; for example, a decision tree will always take the same path for the same inputs, a neural network will always compute the same floating point numbers for the same inputs, reaching the same prediction.

Many of these model qualities directly and indirectly influence the cost of predictions through hardware needs and operating costs. Larger models require more expensive hardware, deep learning models relying heavily on floating point operations benefit from GPUs, and higher throughput demand can be served with more computational resources even for slower models. Some companies like Tesla even develop specialized hardware to support the necessary computing power for the vast amount of floating point computations needed by deep neural models, while meeting latency or throughput requirements in the real-time setting of an automated system in a car that receives a constant stream of sensor inputs. An often useful measure that captures operating cost for a specific model is the cost per prediction (which also factors in cost for training the model, discussed below). If the benefits of the model in a production system (e.g., more sales, detected fraud, ad revenue) or the cost a client is willing to pay does not outweigh the cost per prediction of the model, it is simply not economically viable. In our credit card fraud detection scenario, cost per prediction is an important measure, because revenue is directly related to the volume of transactions.

Beyond measures relevant for serving the model, engineers are often interested in other properties of a model that influence how the model can be used in production as part of a system. Interpretability and explainability are often important qualities for a model in a system, which describe to what degree a human can understand the internals of a model (e.g., for debugging and auditing) and to what degree the model can provide useful explanations for why it predicts a certain output for a given input. Explainability may help developers to gain confidence in the correctness of the model if they can inspect that the decisions and thresholds of a decision tree seem plausible. Explainability may also be essential for a user of the system to develop trust in a model’s predictions, say for a bank clerk to process a predicted fraudulent credit card transaction. Small decision trees are usually easy to understand for humans, whereas parameters of a deep neural network do not convey any intuitive meaning. We will discuss this much more in the corresponding chapter Interpretability and Explainability.

Several further qualities are sometimes discussed in the context of safety, security, and privacy. A model’s robustness characterizes to what degree a model’s predictions are stable when the input is changed in minor ways. A model is called calibrated when the confidence scores the model produces reflect actual probability that the prediction is correct. More robust and better calibrated models may be used with more confidence by other parts of the system, especially when safety is a concern. Knowing that a prediction is robust to certain perturbations of the input may also be a useful insight when assessing whether the model is attacked with malicious inputs. Some researchers have also suggested assessing privacy as a quality, for example, measuring to what degree individual training data can be recovered from a model. Again, we will discuss these in more detail in the Safety and Security chapters.

Also model fairness deserves its own chapter and characterizes various notions of how a model relies on protected attributes for its predictions or to what degree it has different qualities (e.g., in terms of predicted outcomes or accuracy of those predictions) for different regions of the input, typically split by gender, race, or other protected attributes. Again it is important to understand the model qualities and how the model interacts with other parts of the system to achieve system goals. A system may still be fair if bias in models is compensated by other parts of the system design, but it is important to understand model qualities to design the rest of the system accordingly.

Most quality attributes of models can be severely influenced by the choice of machine-learning algorithm (and hyperparameter) used to train the model. Many, including accuracy, robustness, and fairness also depend heavily on the training data used to train the model.

Common quality attributes of machine-learning algorithms

In addition to quality attributes for the learned model, engineers also often need to make decisions that consider quality attributes of the learning process, especially if the model is to be retrained regularly in production.

A key concern about the training process is training latency, that is, how long it takes to train or retrain a model. Some machine-learning algorithms can be distributed more or less easily (learning is usually not embarrassingly parallel as serving, so usually more complexity is involved in distributing learning, which we briefly discuss in chapter Scaling the System). Some learning algorithms scale better to (1) larger datasets or (2) more features than others. Hardware requirements regarding memory, CPU, and GPU also vary widely between learning algorithms, such as deep learning benefitting substantially from GPUs for all the floating point arithmetic involved in learning. All this influences training cost which again informs the amount of experimentation that is feasible for an organization and the frequency that models can be retrained feasibly in production. For example, the GPT-3 model is estimated to have cost between 4 and 12 million US dollars for necessary computing resources alone (about 355 GPU-years) for a single training run. In our fraud detection scenario, long training costs may be acceptable as they are amortized across many predictions; regular retraining to account for new fraud patterns will be necessary, but probably occur not much more often than daily or weekly.

The resources that organizations are able or willing to invest for (re-)training and experimentation vary widely between organizations. Especially when it comes to expensive training of large deep neural networks with very large datasets, large organizations with plenty of funding have a significant advantage over smaller organizations, raising concerns that a few large companies may dominate the market for certain kinds of models and hinder competition. In addition to costs, energy consumption during training has also received attention. In contrast to concerns about battery life during inference, concerns during training relate to high energy and corresponding CO2 emissions, pointing out that training of some large deep neural networks (transformer with neural architecture search or GPT-3) require electricity that corresponds to an estimated 284 tons of CO2 emission, which approximately corresponds to the lifetime CO2 emission of an average human.

Some machine-learning algorithms allow incremental training, which can significantly reduce training costs in some settings, especially if more training data is added over time, say from telemetry of the production system. For example, deep neural networks are trained incrementally to begin with and can be continuously trained with new data, whereas the standard decision tree algorithm needs to look at all training data at once and hence needs to start over from scratch when the training data changes. If we have access to live fraud claims in the credit card fraud scenario, incremental training on live data might surface new fraud patterns very quickly.

In a production system, the cost per prediction may be dominated by training costs, which can in some settings dwarf the costs for serving the model in production, especially if extensive experimentation was involved in building the model or when models are regularly retrained at high cost or the volume of predictions is low.

Beyond qualities related to cost and scalability, there are also a number of other considerations for training algorithms that are relevant in some settings. For example, some algorithms may work better on smaller datasets, some require less investment in feature engineering, some only learn linear relationships among features, some are more robust to noisy training data, some are more stable and reproducible, possibly even deterministic in training. For example, deep neural networks are highly nondeterministic in training (e.g., random initial seeds, timing and batching differences during distributed learning) and may produce models with substantial differences in accuracy even when using the exact same hyperparameters; in contrast, basic decision trees algorithms are entirely deterministic and will reproduce the same model given the same data and hyperparameters.

Other qualities in production ML projects

Beyond qualities of models and machine-learning algorithms, there are also many qualities that will influence other considerations for machine-learning components. For example, if models are to be retrained regularly, automation, stability, reproducibility, and especially observability of the entire machine-learning pipeline becomes more important and we may want to push new models into production quickly or enable continuous learning and continuous experimentation in practice. In our fraud detection scenario, we likely want to plan for regular model updates from the start and hence will value automation and observability. Similarly, we might care about qualities of our monitoring infrastructure, such as, how much data is produced, whether private data is anonymized, how sensitive our monitoring instruments are, and how quickly anomalies can be reported. We will discuss these properties in chapters Automating the ML Pipeline and Planning for Operations.

Beyond qualities of ML components, of course, we also care about qualities of the system as a whole, including response time, safety, security, and usability. That is, traditional requirements engineering for the entire system and its non-ML components is just as important.

Constraints and Tradeoffs

Understanding quality requirements for machine-learning components will help data scientists to make more informed decisions as they select machine-learning algorithms and develop models. A model will never excel at all qualities equally, so data scientists and software engineers will need to make decisions that trade off different qualities, for example, how much loss of accuracy is acceptable to cut the inference latency in half. Many tradeoff decisions will be nonlocal and involve stakeholders of different teams, for example, (1) can the frontend team sacrifice some explainability of the model for improved accuracy, (2) can software engineers better assure safety with a model that is less accurate but calibrated, and (3) can the operations team provide the infrastructure to retain the model daily and still have capacity for experimentation? Being explicit about quality requirements facilitates such discussions within and across teams.

A useful way of thinking about possible decisions is as a form of design space exploration. The key idea is that there are many possible design decisions (e.g., what machine-learning algorithm to use, what hyperparameters, what hardware resources, how much investment in feature engineering or additional data collection) that interact and together form the design space of all possible designs. Identifying hard constraints that are not negotiable reduces the design space to those that are not obviously infeasible. For example, in our credit fraud detection scenario we may know that cost per predictions cannot exceed $0.001 to make a profit; in the real-time setting of a self-driving car analyzing camera images with 25 images per second, inference latency cannot exceed 1/25 seconds (40 ms). Any solution that does not meet these hard constraints can be discarded; hard constraints can possibly exclude many designs.

Design space exploration: The space of all possible designs (dotted rectangle) is reduced by several constraints on qualities of the system, leaving only a subset of designs for further consideration (highlighted center area).

The remaining feasible solutions in the design space meet the hard constraints but are not usually all equally desirable. Some may correspond to more accurate models, some to more explainable ones, some to lower training costs and so forth. If a design is worse on all qualities than another design, it does not need to be considered further (it is dominated by the other design). The remaining designs are on the Pareto front, where each design is better than each other design on at least one quality and worse on at least another quality.

Tradeoffs among multiple design solutions along two dimensions (cost and error). Gray solutions are all dominated by others that are better both in terms of cost and error (e.g., solution D has worse error and worse cost than solution A). The remaining black solutions are each better than another solution on one dimension but worse on another — they are all pareto optimal and which solution to pick depends on the relative importance of the dimensions.

Which design to choose on a Pareto front depends on the relative importance of the involved qualities — a designer must now find a compromise. This could be optimizing for a single quality (e.g., prediction accuracy) or balancing between multiple qualities. If relative importance or a utility function for the different qualities are known, one could identify the sweet spot mathematically, though in practice making such tradeoff decisions typically involves negotiating between different stakeholders and engineering judgment. For example, is explainability or a marginal improvement in accuracy in the fraud detection scenario worth a small increase in the cost per prediction? Different team members may have different opinions and need to negotiate an agreement. Again, making the tradeoffs apparent will help to foster such negotiation and help identify which qualities are in conflict and which qualities are more important for the product overall.

While the general idea of tradeoffs is straightforward and may seem somewhat simplistic, these are real concerns in production machine-learning projects. For example, many submissions to Netflix’s famous competition for the best movie recommendation algorithms produced excellent results, but Netflix engineers stated: “We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”

Negotiating Model Quality Requirements

Even if system requirements have been elicited carefully, identifying quality requirements for ML components during the design process is challenging and typically requires negotiation between different stakeholders regarding responsibilities of ML and non-ML components. Even worse, in practice, system requirements are often not fully known and also negotiable to some degree depending on capabilities provided by ML components (see also chapter Data Science and Software Engineering Process Models).

Model requirements checklist. The following can be used as a checklist to make sure the most important qualities are covered, when considering ML components in a system:

  • Set minimum accuracy expectations, if possible. In many cases there is an existing baseline below which the model does not make useful contributions to the system. Needs for robustness and calibration of a model in the context of the system should also be considered.
  • Identify runtime needs at inference time for the model. This may involve estimating the number of predictions needed in the production system, latency requirements, and a cost budget for operating the inference service. The deployment architecture (see next chapter) will influence needs here significantly and will conversely be informed by achievable qualities of a model, hence may require negotiation between multiple teams.
  • Identify evolution needs for the model, that is how often the model will need to be updated and what latency is needed for those updates. Again, different system designs can impose different model quality requirements and there are opportunities to negotiate different system designs. Understanding the budget for training and experimentation will help to make informed design decisions at the model level. Evolution needs and the amount of drift to be expected also strongly influence observability needs for the model, the ML pipeline, and the system as a whole.
  • Identify explainability needs for the model in the system. The system design and user interface may impose specific requirements on explainability or explainability may provide opportunities for different designs.
  • Identify protected characteristics and fairness concerns in the system, how the model relates to them, and what level of assurance or auditing will be needed. This may impose restrictions for model design and selecting machine-learning techniques.
  • Identify how security and privacy concerns in the system relate to the model, including both legal and ethical concerns. This may impose constraints on what data can be used and how it can be used, or what machine-learning techniques can be applied to not leak characteristics of private training data.
  • Understand what data is available (quantity, quality, formats, provenance), which may inform what machine-learning techniques are feasible. This may also conversely inform other stakeholders whether more data needs to be collected to build a desired model.

All these concerns involve not only the model as a single component, but require understanding how the model is situated within a system, interacts with other ML and non-ML components, and contributes to the overall system goals. When decomposing the system into components, the system requirements should be mapped to component requirements. In almost all cases, there is space for negotiation in tradeoff decisions where certain design decisions in the model influence other design decisions in the system or vice versa.

Views on system and model quality. A different way to approach identifying qualities is to explicitly consider alternative views of the system (see Siebert et al. for details). As discussed in the previous chapter, by adopting different views, stakeholders force themselves to consider different perspectives and abstractions. This strategy can help to zoom in on specific concerns and in aggregate look at the system and its components more holistically:

  • In a model view, consider direct quality expectations on the model, such as accuracy and explainability
  • In a data view, consider the availability, quantity and quality of data.
  • In a system view, consider goals and requirements for the entire system from the perspective of the provider and end users; specifically consider how the software interacts with the environment (see the world and machine chapter).
  • In an infrastructure view, consider the infrastructure needed to operate the system, including training cost, reproducibility needs, infrastructure for model serving, and monitoring needs.
  • In an environment/societal view, consider how the system interacts with users and the society at large, including considering possible safety and fairness concerns.

As all requirements engineering and system design, this process is not easy and will go through multiple iterations. Typically this involves interviewing stakeholders (e.g., customers, operators, developers, data scientists, business experts, legal experts) to understand the problems and the needs for the system and the model. The needs should be collected, documented, and as conflicts emerge, discussed and prioritized.

As usual in software engineering, if qualities are not discussed early and design decisions are not made deliberately, there is a high risk of making fundamental early design decisions that are difficult to correct later. As discussed in the previous two chapters, deliberate design can reduce the cost for rework and enable system qualities.

Ideally (though currently rarely happening in practice), identified and negotiated requirements are explicitly documented to serve as a contract between teams. When teams cannot deliver components according to those contracts (or when system requirements change), those contracts may need to be renegotiated.

Summary

When designing a system with ML components, desired system qualities and functionalities inform quality requirements for the individual components of the system, including machine-learned models, machine-learning pipelines, and monitoring infrastructure.

Data scientists have a large number of design decisions when training a model for a specific prediction problem that influence various qualities, such as prediction accuracy, inference latency, model size, cost per prediction, explainability, and training cost. When designing a production system, it is usually necessary to pay attention to many qualities, not just accuracy of the model. Various stakeholders, including software engineers and data scientists, typically have flexibility in negotiating requirements and the language of design space exploration (constraints, trade offs) can help to identify requirements and facilitate negotiation between different stakeholders about component responsibilities.

Further reading

  • Classic text on the meaning of quality generally: Garvin, David A., What Does Product Quality Really Mean. Sloan management review 25 (1984).
  • A discussion of the role of requirements engineering in identifying relevant qualities for the system and model: Vogelsang, Andreas, and Markus Borg. “Requirements Engineering for Machine Learning: Perspectives from Data Scientists.” In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019.
  • A discussion of qualities from different views on a production machine learning system: Siebert, Julien, Lisa Joeckel, Jens Heidrich, Koji Nakamichi, Kyoko Ohashi, Isao Namba, Rieko Yamamoto, and Mikio Aoyama. “Towards Guidelines for Assessing Qualities of Machine Learning Systems.” In International Conference on the Quality of Information and Communications Technology, pp. 17–31. Springer, Cham, 2020.
  • An argument to consider architectural quality requirements and to particularly focus on observability as a much more important quality in ML-enabled systems: Lewis, Grace A., Ipek Ozkaya, and Xiwei Xu. “Software Architecture Challenges for ML Systems.” In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 634–638. IEEE, 2021.
  • Discussion of energy use in deep learning and its implications: Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and Policy Considerations for Deep Learning in NLP.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650. 2019.
  • Many books describe how various machine learning algorithms work and various properties they have and qualities they prioritize. For example, Géron, Aurélien. ”Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow.” 2nd Edition, O’Reilly. (2019) has several practical and hands-on discussions.

As all chapters, this text is released under Creative Commons 4.0 BY-SA license.

--

--

Christian Kästner

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling