Setting and Measuring Goals for Machine Learning Projects

22 min readSep 2, 2022

This post covers some content from the “Goals and Success Measures” lectures of our Machine Learning in Production course. For other chapters see the table of content.

With a strong emphasis on machine learning, many projects focus on optimizing ML models for accuracy. However, when building production systems, the machine learning components contribute to an overall goal of the system. To build products successfully, it is important to understand both the goals of the entire system and the goals of the model within the system. In addition, ideally, we define goals in measurable terms, so that we can assess whether we achieve them or at least make progress toward them. In this chapter, we will discuss how to set goals at different levels and discuss how to define and evaluate measures.

Running Example: Self-help legal chatbot

Consider you are building a business from offering marketing services to lawyers and law firms. Specifically, you offer tools that your customers can integrate into their website to attract clients. Aside from more traditional mechanisms like social media campaigns, question and answer sites, and traffic analysis, you plan to provide a chatbot where potential clients can ask questions over a text chat and may receive initial pointers to their problems, such as, finding forms for filing for a divorce and answering questions for child custody rules. The chatbot may answer some questions directly and will otherwise ask for contact information to connect the potential client with a lawyer and may potentially also schedule a meeting. You already have a version of this with human operators, but this is pretty expensive and you want to automate the chatbot. Rather than old-fashioned structured chatbots that follow a flow-chart of preconfigured text choices, you want to offer a modern chatbot that uses natural language technology to understand the user’s question and provides answers from a knowledge database.

You plan to build this based on one of the many available commercial frameworks for chatbots with training data that you have from the past version with human operators. Tasks include understanding what users talk about and guiding conversations with follow up questions and answers.

*Example of a chatbot trying to engage with a user.*

Setting Goals

Setting clear and understandable goals can help to frame the direction of a project and bring all team members working on different parts of the project together under a common vision. In many projects, goals can be implicit or not clear to all team members, so that some members might focus only on their local subproblem, such as optimizing accuracy of a model, without considering the broader context. When ideas for products emerge from new machine-learning innovations, such as looking for new applications of smart chatbots, there is a risk that the team gets carried away by the excitement about the technology and never steps back to think about the goals of the product they are building around the model.

Technically, goals are prescriptive statements about intent. Usually achieving goals requires the cooperation of multiple agents, where agents could be humans, various hardware components, and existing and new software components. Goals are usually general enough to be understood by a wide range of stakeholders, including the team members responsible for different components, but also customers, regulators, and other interested parties. It is this interconnected nature of goals that makes understanding goals important to achieve the right outcome and coordinate the various actors in a meaningful way.

Establishing high-level project goals is usually one of the first steps in eliciting the requirements for the system, and they may be revisited regularly as requirements are collected, solutions are designed, or the system is observed in production. Goals are also useful for the design process, when decomposing a system and assigning responsibilities to components, we can identify component goals and ensure they align with the overall system goals. Goals often provide a rationale for more specific technical requirements and for design decisions. Goals also provide a first guidance on how we evaluate success of the system in an evaluation in terms of measuring to what degree we achieve the goals. For example, communicating clear goals of the self-help legal chatbot to the data scientist working on a model will provide context about what model capabilities and qualities are important and how they support the system’s users and the organization developing the system.

Goals at different levels

Goals can be discussed at many levels and untangling different goals can help to understand the purpose of a system better. When asked what the goal of a software system is, developers often give answers in terms of services their software offers to users, usually helping users with some task or automating some tasks — for example, our legal chatbot tries to answer legal questions. When zooming out though, much software is built with the business goal of making money, now or later, whether through licenses, subscriptions, or advertisement — the legal chatbot is licensed to lawyers and the lawyers hope to attract clients.

To untangle different goals, it is useful to question goals at different levels and to discuss how different goals relate to each other. We find Hulten’s classification of goals useful (and recommend the full discussion in Chapter 4 of Building Intelligent Systems):

Organizational goals: The most general goals are usually at the organizational level of the organization building the software system. Outside of nonprofits, organizational goals almost always relate to money: revenue, profit, or stock price. For nonprofit organizations they may relate to the specific mission of the organization, such as increasing animal welfare, reducing CO2 emissions, improving social justice, and curing diseases. The company in our chatbot scenario is a for-profit enterprise pursuing short term or long term profits by licensing marketing services.

Since organizational objectives are often high-level and pursue long-term goals that are difficult to measure now, often leading indicators are used as more readily observable proxy measures that are expected to correlate with future organizational success. For example, in the chatbot scenario, the number of lawyers licensing the chatbot is a good proxy for expected quarterly profits, the ratio of new and canceled licenses provides insights into the trends of revenue, and referrals and customer satisfaction are potential indicators of future trends in license sales. In many organizations, leading indicators may also be framed as key performance indicators.

System goals: When building a system, we can usually articulate the goals of the system in terms of concrete outcomes the system should produce. For example, the self-help legal chatbot has the goals of promoting individual lawyers by providing modern and helpful web pages and helping lawyers to connect with potential clients, but also just the practical goal of helping potential clients in a quick and easy fashion with simple legal questions. System goals describe what the system tries to achieve in terms of behavior or quality.

User goals: Users typically use a software system with a specific goal. In some cases like the chatbot example, we have different kinds of users: One one hand, lawyers are users that license the chatbot to attract new clients. On the other hand, clients asking legal questions are users of the system too who hope to get legal advice. We can attempt to measure how well the system serves its users, such as the number of leads generated or the number of clients who indicate that they got their question answered sufficiently by the bot. We can also explore users’ goals with regard to specific features of the system, for example, to what degree the system is effective at automating simple tasks such as preparing a filing for a neighborhood dispute for small-claims court.

In addition to direct users, there are often also people who are indirectly affected by the system or who have expectations of the system. In our chatbot example, this might include judges and defendants who may face arguments supported by increased legal chatbot use, but also regulators who might be concerned about who can give legal advice. Understanding their goals will also help us when eliciting more detailed requirements for the system, as we will discuss in the next chapter.

Model goals: From the perspective of a machine-learned model, the goal is almost always to optimize the accuracy of predictions. Model quality can be measured offline with test data (as we will discuss in chapter Model quality: Measuring prediction accuracy), but we can also typically approximate accuracy in a production setting by observing telemetry (see chapter Quality assurance in production). In our chatbot scenario, we may try to measure to what degree our natural-language-processing components correctly understand a client’s question and answer it sensibly.

Relationships between goals

Goals at the various levels are usually not independent. Satisfied users tend to be returning customers and might recommend our product to others and thus help with profits. If system and user goals align, then a system that better meets its goals may make users happier and users may be more willing to cooperate with the system (e.g., react to prompts). Better models hopefully make our users happier or contribute in various ways to making the system achieve its goals. In our chatbot scenario, we hope that better natural language models lead to a better chat experience, making more potential clients interacting with the system, leading to more client connections for lawyers, making the lawyers happy, who then renew their license, …

*Different kinds of goals often support each other, but they do not always align.*

Notice that user goals, model goals, system goals, and organizational goals do not always align. It is unlikely that a 5 percent improvement in model accuracy translates directly into a 5 percent improvement in user satisfaction and a 5 percent improvement in profits. In chapter From Models to Systems, we have already seen an example of such a conflict from booking.com, where improved models in many experiments did not translate into improved hotel bookings (the leading indicator for the organization’s goal). In the chatbot example, this potential conflict is even more obvious: More advanced natural language capabilities and legal knowledge of the model may lead to more legal questions that can be answered without involving a lawyer, making clients seeking legal advice happy, but potentially reducing the lawyer’s satisfaction with the chatbot as fewer clients contract their services. It may be perfectly satisfactory to provide basic capabilities, without having to accurately handle corner cases, since it is acceptable to fail and indicate that the question is too complicated for self-help, connecting the client to the lawyer. In many cases like this, a good enough model may just be good enough.

It is usually a good idea to clearly identify goals at all levels and understand how they relate to each other. It is particularly important to contextualize the model goals in the context of the goals of the users and the organization. Balancing conflicting goals may require some deliberation and negotiation that are normal during requirements engineering, as we will explore in the next chapter. Identifying these conflicts in the first place is valuable because it allows explicit discussions and design toward their resolution.

Requirements engineers have pushed analysis of goals far beyond what we can describe here. For example, there are several notations for goal modeling, to describe goals (at different levels and of different importance) and their relationships (various forms of support and conflict and alternatives), and there are formal processes of goal refinement that explicitly relate goals to each other, down to fine-grained requirements. We recommend exploring the relevant literature and working with a requirements-engineering expert for projects that want to explore goals and requirements in more depth.

Measurement in a Nutshell

Goals can be effective controls to steer a project if they are measurable, allowing us to assess to what extent we have been able to achieve the goals or to what extent new functionality, such as one offered by a machine-learning component, contributes to our organizational goals.

Measurement is important not only for goals, but also for all kinds of activities throughout the entire development process. We will discuss measurement in the context of many topics throughout this book, including establishing and evaluating quality requirements and discussing design alternatives (chapter Quality Attributes of ML Components), evaluating model accuracy (chapter Model Quality), monitoring system quality (chapters Planning for Operations and Quality Assurance in Production), assessing fairness (chapter Fairness), and monitoring development progress (chapter Data science and software engineering process models). That is, measurement is important for all kinds of activities, hence, in the remainder of this chapter, we will briefly dive a little deeper into measurement: how to design measures, common pitfalls, and how to evaluate measures.

Everything is measurable

In its simplest form, measurement is simply the assignment of numbers to attributes of objects or events by some rule. More practically, we perform measurements to learn something about objects or events with the intention of making some decision. Hence Douglas Hubbard defines measurement as “a quantitatively expressed reduction of uncertainty based on one or more observations.”

In his book “How to Measure Anything,” Hubbard makes the argument that everything that we care about enough to consider in decisions is measurable in some form, even if it is generally considered “intangible,” but not every measurement is economical or precise. The argument essentially goes as: (1) If we care about a property, then it must be detectable. This includes properties like quality, risk, and security, because we care about achieving some outcomes over others. (2) If it is detectable at all, even if just partially, then there must be some way of distinguishing better from worse, hence we can assign numbers. These numbers are not always precise, but they give us additional information to reduce our uncertainty, which helps us make better decisions.

Typically, we can invest more effort into getting better measures. For example, when deciding which candidate to hire to develop the chatbot, we can rely on easy to collect information such as college grades or a list of past jobs, but we can also invest more effort by asking experts to judge examples of their past work or asking candidates to solve some nontrivial sample tasks, possibly over extended observation periods, or even hiring them for an extended try-out period. Typically, with more investment into measurement we can improve our measures, which reduces uncertainty in decisions, which allows us to make better decisions. In the end, how much to invest in measurement depends on the payoff expected from making better decisions. For example, making better hiring decisions can have substantial benefits, hence we might invest more in evaluating candidates than we would measuring restaurant quality when deciding on a place for dinner tonight.

In software engineering and data science, measurement is pervasive to support decision making. For example, when deciding which project to fund, we might measure each project’s risk and potential; when deciding when to stop testing, we might measure how many bugs we have found or how much code we have covered already; when deciding which model is better, we measure prediction accuracy on test data or in production.

On terminology. Quantification generally is the process of turning observations into numbers — it underlies all measurement. A measure and a metric refer to a method or standard format of measuring something, such as the percentage of false positive predictions of a classifier or the number of lines of code written per week. The terms measure and metric are often used interchangeably and we will just use the term measure throughout this book, though authors make distinctions, such as metrics being derived from multiple measures or metrics being standardized measures. Finally, operationalization refers to identifying and implementing a method to measure some factor, for example, identifying false positive predictions from log files or identifying changed and added lines per developer from a version control system.

Defining measures

Throughout the entire development lifecycle, we routinely use lots of measures. For many tasks, well accepted measures already exist, such as measuring precision of a classifier, measuring network latency, or measuring company profits. However, it is equally common to define custom measures or custom ways of operationalizing measures for a project, such as measuring the number of client requests answered satisfactorily with the chatbot or measuring the satisfaction level of lawyers using the chatbot as observed through a survey. Beyond goal setting, we will particularly see the need to become creative with creating measures when evaluating models in production, as we will discuss in chapter Quality Assurance in Production.

Stating measures precisely. In general, it is a good practice to describe measures precisely to avoid ambiguity. This is important for goal setting and especially for communicating assumptions and guarantees across teams, such as communicating the quality of a model to the team that integrates the model into the product. As a rule of thumb, imagine a dispute where you need to argue in front of a judge that you achieve a certain goal for the measure and where somebody else is tasked to independently reimplement the measure — you want the description of the measure to be precise enough to have reasonable confidence in these settings. Consider the following examples:

Instead of “measure accuracy” specify “measure accuracy with MAPE,” which refers to a well defined existing measure (see also chapter Model quality: Measuring prediction accuracy).
Instead of “evaluate test quality” specify “measure branch coverage with Jacoco,” which uses a well defined existing measure and even includes a specific measurement instrument (tool) to be used for the measurement.
Instead of “measure execution time” specify “average and 90%-quantile response time for the chatbot’s REST-API under normal load,” which describes the conditions or experimental protocols under which the measure is collected.
Instead of “measure customer satisfaction” specify “report response rate and average customer rating on a 5 point rating scale shown to 10% of all customers after interacting with the chatbot for 10 messages (randomly selected),” which describes the reproducible measurement procedure.

Descriptions of measures will rarely be perfect and ambiguity free, but better descriptions are more precise. Relying on well-defined and commonly accepted standard measures where available is a good idea.

Composing measures. Measures are often composed of other measures. Especially higher-level measures such as product quality, user satisfaction, or developer productivity are often multi-faceted and may consider many different observations that may be weighed in different ways. For example, “software maintainability” is notoriously difficult to measure, but it can be broken down into concepts such as “correctability,” “testability,” and “expandability,” for which it is then easier to find more concrete ways of defining measures, such as measuring testability as the amount of effort needed to achieve statement coverage.

*Example of developing a measure for code maintainability from lower-level measures.*

When developing new measures, especially more complex composed ones, many researchers have found the Goal-Question-Metric approach useful: First clearly identify the goal of the measure, then identify questions that can help answer whether the goal is achieved, and finally identify concrete measures that help answer the questions. The approach additionally encourages to make stakeholders and context factors explicit. The key benefit of such a structured approach is that it avoids ad-hoc measures and a focus on what is easy to quantify, but instead focuses on a top-down design that starts with a clear definition of the goal of the measure and then maintains a clear mapping of how specific measurement activities gather information that are actually meaningful toward that goal.

Measuring. Once we have defined a measure, we can actually go out and measure things. To do this, we need to collect data and derive a value according to our measure from that data. Typically, actual measurement requires three ingredients:

Measure: First, the measure itself, describing what we try to capture, as discussed above.
Data collection: A way to collect data from the system, possibly changing the system to collect additional data.
Operationalization: A mechanism of computing the measure from the data.

In some cases, data collection and operationalization are straightforward, because it is obvious from the measure what data needs to be collected and how the data is interpreted — for example, measuring the number of lawyers currently licensing our software can be answered with a lookup from our license database and to measure test quality in terms of branch coverage standard tools like Jacoco exist and may even be mentioned in the description of the measure itself. In other cases, it is clear what should be measured, but we need to identify how to actually operationalize the measure with the data we have — for example, how to identify response time of the chatbot from server logs it produces, possibly changing the chatbot software to create more logs that we can analyze. Yet in other case, we may need to get creative in what data we could collect and how we could operationalize it for a measure — for example, to measure customer satisfaction we may need to develop infrastructure to show a survey to customers or we could approximate it from whether they abort interacting with the chatbot. We will discuss many examples of creative operationalization of measures when it comes to measuring model accuracy in production environments in chapter Quality Assurance in Production.

Evaluating the quality of a measure

It is easy to create new measures and operationalize them, but it is not always clear that the measure really expresses what we intend to measure and whether we get meaningful numbers.

Accuracy and precision. A useful distinction for reasoning about any measurement process is distinguishing between accuracy and precision (not to be confused with recall and precision in the context of evaluating model quality).

Similar to the accuracy of machine-learning predictions, the accuracy of a measurement process is concerned with how closely measured values (on average) represent the real value we want to represent, often some quantity in the real world. For example, the accuracy of our measured chatbot subscriptions is evaluated in terms of how closely it represents the actual number of subscriptions and the accuracy of a user-satisfaction measure is evaluated in terms of how well the measured values represents the actual satisfaction of our users.

In contrast, precision refers to how reliably a measurement process produces the same result (whether correct or not). That is, precision is a representation of measurement noise. For example, if we repeatedly count the number of subscriptions in a database, we will always get the same result, but if we repeatedly ask our users about their satisfaction we might observe some variations in the measured satisfaction.

Ideally, we want measures that are both accurate and precise, but that is not always the case. We can have measures that are imprecise but accurate: If we just repeat the measurement multiple times and average the observations, we get a pretty accurate result (assuming random noise). For example, most measures of execution time are influenced by random noise and background processes, but they average out across multiple measurements. In contrast, there can also be measures that produce very precise but inaccurate data, such as a clock that is running behind: It will consistently tell the wrong time, no matter how many measurements we average over. Precise but inaccurate measurements are dangerous, because they seem very confident (no noise) but are entirely wrong.

Visualization of the difference between accuracy and precision, showing for example how multiple results can be very close to each other (precise) but far away from the target (inaccurate). CC-BY-4.0 by Arbeck.

In measurement, we are generally concerned both about inaccuracy and imprecision (noise), and we need to address these concerns quite differently:

Imprecision is usually easier to identify and handle, because we can see noise in measurements and can use statistics to handle the noise. Even when we may not have multiple observations for a single data point, noise will often average out over time — for example, if the model computed some answers to chat messages a bit faster due to random measurement noise, it may be a bit slower for others later, and won’t affect our system’s overall observation of response time too much.
Inaccuracy, in contrast, is much more challenging to detect and handle, because it represents a systematic problem in our measures, that cannot be detected by statistical means. For example, if we mark expired chatbot subscriptions as expired in the database but still accidentally count them when computing the number of lawyers subscribing to the chatbot, we will get a perfectly repeatable (precise) measure that reports too many subscriptions and too much income until the discrepancy is noticed eventually, if ever. To detect inaccuracy in a data generation process, we need to systematically look for problems and biases in the measurement process or somehow have access to the true value to be represented.

In a software engineering context, almost all measurement conducted through automated computations without relying on human inputs or sensor inputs is almost always highly precise (repeatedly running the same algorithm with the same inputs will produce the same outputs), but it may be inaccurate if there are bugs in the data collection or operationalization.

Validity. Finally, for new measures, it is worth evaluating measurement validity. As an absolute minimum the developers of the metric should plot the distribution of observations and sample and manually inspect some results to ensure that they make sense. If a measure is important, validity evaluations can go much further and can follow structured evaluation procedures (e.g., as suggested in the excellent paper “Software engineering metrics: What do they measure and how do we know”). Typically, validity evaluations ask at least three kinds of validity questions. Construct validity: Are we measuring what we intend to measure? Does the abstract concept match the specific scales and operationalizations used? Predictive validity: Does the measure have an ability to (partially) explain a quality that we care about? Does it actually provide meaningful information to reduce uncertainty in the decision we want to make? External validity: Does the measure and its operationalization generalize beyond the specific observations with which it was initially developed?

Pitfall: The streetlight effect

For many decisions and goals it can be difficult to come up with good measures that are economical. When creating new measures and setting goals it is tempting to focus on measuring characteristics that are easy to measure, for example because measures already exist. As a consequence, we may often use cheap to collect proxy measures that only poorly correlate with a hard to measure goal, such as using the number of messages clients exchange with the chatbot (easy to measure from logs) as a proxy measure for client’s satisfaction with the chatbot.

The temptation to use convenient proxies for difficult to measure concepts is known as the streetlight effect, a common observational bias. The name originates from the anecdote of a drunkard looking for his keys under a streetlight rather than where he lost them a block away, because “this is where the light is.” Creating good measures can require some effort and cost, but it may be worth it to enable better decisions than possible with ad-hoc measures based on data we already have.

*Comic illustrating the streetlight effect: Focusing attention on aspects that are easy to observe.*

Pitfall: Measures used as incentives

Measures can be powerful to make better decisions, to observe improvements in a system, and set goals, but there is always a danger of optimizing for a measure that only partially represents a goal. In practice, most measures only approximate the actual goal and can be gamed to optimize for the measure in a way that does not necessarily meet the goal. For example, it may be a reasonable approximation to measure the number of bugs fixed in software as an indicator of good testing practices, but if developers were evaluated by the number of bugs fixed they may decide to game the measure by intentionally introducing bugs that they can then subsequently fix. Humans and machines are generally good at finding loopholes and optimizing for measures if they set their mind to it. This is also often referred to as Goodhart’s law “When a measure becomes a target, it ceases to be a good measure.”

In the context of machine learning, this problem often occurs as the alignment problem, where the system optimizes for a specific fitness function (the measure) that may not fully align with the goals of the system designer. For example, a common science fiction trope, as in the Terminator movies, is a machine that decides to achieve its goal of guarding peace by trying to kill all humans. We will return to this issue in later chapter Safety. This issue also comes up when discussing whether explanations provided for predictions from machine-learned models just invites users to game the system, as we will discuss in chapter Interpretability and explainability.

Beyond intentional attempts to game measures, many studies have shown that providing incentives based on measures as extrinsic motivation (e.g., bonus payments based on some measurement outcome) is problematic for creative jobs. It reduces intrinsic motivation, can become addictive, and encourage cheating and short-term thinking. Setting goals and defining measures is good to set a team on a joint path and foster communication, but avoid using measures as incentives.

Summary

To design a software product and its machine learning components, it is a good idea to start with understanding the goals of the system, including the goals of the organization building the system, the goals of users and other stakeholders, and the goals of the ML and non-ML components that contribute to the system goals. In setting goals, providing measures that help us evaluate to what degree goals are met or whether we are making progress toward those goals is important, but can be challenging.

In general, measurement is important for many activities when building software systems. Even seemingly intangible properties can be measured (to some degree) with the right measure. Some measures are standard and broadly accepted, but in many cases we may define, operationalize, and validate our own. Good measures are concrete, accurate, and precise and fit the purpose for which they are designed.