Quality Assurance Basics

Christian Kästner
19 min readNov 3, 2022

--

This post covers some content from the “Infrastructure Quality” lectures of our Machine Learning in Production course. For other chapters see the table of content.

Quality assurance broadly refers to all activities that evaluate whether a (software) system meets the quality requirements. Typical activities include testing and code reviews, but there is a broad portfolio of different techniques that can be used. Typical qualities to be assured include functional correctness according to some specification, operational efficiency (e.g., execution time), and usefulness of the software for an end user. Quality can be assured for the system as a whole but also for individual parts of the system. Different approaches can assure quality with different levels of confidence, some can even make formal guarantees. In the following, we provide a very brief overview of the different quality assurance approaches common in software engineering, before subsequent chapters explore specifics of how these techniques can be used or need to be adjusted for machine-learning components in ML-enabled systems.

Quality assurance approaches for software can be roughly divided in three large groups:

  • Dynamic approaches that execute the software, including traditional testing and dynamic program analyses
  • Static approaches that analyze code, documentation, designs, or other software artifacts without executing them, including code review, code linting, and static data-flow analysis
  • Process evaluation approaches that analyze the process of how the software was produced rather than the software artifacts

Parts of quality assurance can often be automated, as we will discuss. For example, it is common to automatically execute large tests suites and static analysis tools after every change to a code base. At the same time, some quality assurance approaches rely on manual steps or are entirely manual, such as asking experts to review design documents to find potential performance bottlenecks or security weaknesses.

Quality assurance often focuses on functional correctness to evaluate whether the software produces the expected outputs according to some specification. Beyond functional correctness there are many other qualities of interest for components or the system as a whole, as discussed throughout the requirements engineering and architecture chapters, such as execution latency, scalability, robustness, safety, security, maintainability, and usability. Some assurance approaches are a better fit for some qualities than others, for example execution latency of individual units can be measured with automated tests, but usability may be hard to evaluate with automated tests and may be better evaluated in a lab study with users, whereas maintainability may be better judged by humans reading the source code, and then static analysis tools may formally prove the absence of SQL-injection security vulnerability in the source code.

Testing

Testing is the most common quality assurance strategy for software. Testing executes the software with selected inputs in a controlled environment to check that the software behaves as expected — testing is a dynamic approach. Tests are commonly automated, but testing can also be performed entirely manually, for example testing a shopping cart function by instructing human testers to click through a website in a certain sequence to order a product.

What it means for software to “behave as expected” is flexible toward various qualities: Commonly a tests checks that the computed output corresponds to the expected output for a given input (testing functional correctness, according to some specification), but we can also test whether a software responds as quickly as expected (performance testing) or whether end users find it easy to navigate the user interface (usability testing), among many other qualities.

Test automation

To automate testing, tests must be written as code. Such a test will (a) call the software under test to execute it, will (b) provide inputs for the execution, will © control the environment if needed (e.g., initialize dependencies, provision an empty database), and will (d) check whether the execution matches expectations, commonly by checking the output against some expected values. In traditional software testing, this is often done by unit testing frameworks like JUnit or pytest. For example, the following code has two independent tests, that test two aspects of a next_date function:

from system_under_test import next_datedef test_end_of_month():
assert next_date(2011, 1, 31) == (2011, 2, 1)
assert next_date(2011, 2, 28) == (2011, 3, 1)
def test_leap_year():
assert next_date(2020, 2, 28) == (2020, 2, 29)
assert next_date(2020, 2, 29) == (2020, 3, 1)
assert next_date(2024, 2, 28) == (2024, 2, 29)
assert next_date(2028, 2, 28) == (2028, 2, 29)
def test_not_leap_year():
assert next_date(2021, 2, 28) == (2021, 3, 1)
assert next_date(1900, 2, 28) == (1900, 3, 1)

The testing framework provides a common style for writing and executing tests. Tests could also have been written in the main function of a test application, but the testing framework enables to structure the tests and provides a lot of additional infrastructure for tracking and reporting results. Tests written in this style are executed by a test runner, a program part of the test framework that executes all tests and reports results (e.g., the test above can be executed with pytest with python -m pytest test.py). Testing frameworks exist for pretty much every programming language and environment, often with many convenient libraries to write readable tests.

As mentioned, tests can check expected behavior in more ways than just comparing the output against expected outputs. For example, the following (intentionally very simple) tests check that a function answers within a time limit and that a machine-learned model achieves prediction accuracy above a threshold (more on this later).

def test_search_performance():
start_time = time.time()
expensive_search_query()
end_time = time.time()
assert (end_time - start_time) < 1 #second
def test_model_accuracy():
model = load_model("model.tflite")
(X, Y) = load_test_data("labeled_data.dat")
accuracy = evaluate(model, X, Y)
assert accuracy > 0.9

Continuous integration

Even if tests can be executed automatically, they are of little value if tests are not actually regularly executed after changes in practice. If tests are executed only after days or weeks of new development, problems are detected late and harder to fix than if they had been addressed in a timely fashion.

Continuous integration is a very popular technique to automatically run tests whenever code is to be integrated or shared with other team members, typically whenever code is checked into a version control system. Continuous integration services can be set up locally, with tools like Jenkins, or using cloud infrastructure, such as CircleCI, GitHub actions, and many others. Whenever code changes happen in the version control system, the continuous integration service will download that code, build, and test it, and report the test results, for example, sending an email to the author when something broke. In addition, the recent test results can be shown in an online dashboard and in the version control system, e.g., as badges in GitHub projects or build status flags in pull requests.

The key benefit of continuous integration is that developers cannot “forget” to run tests. If somebody checks in code that breaks tests, this is quickly noticed. Moreover, it is also common practice now to merge a code contribution only after the continuous integration service has reported passing tests.

In addition, with continuous integration, tests are always executed by an independent service, which ensures that the tests do not just pass on the developers machine (“works for me”) but also generally in an independent build. The continuous integration service helps to control the environment in which tests are executed. It also ensures that builds and tests are portable and that all dependencies are managed as part of the build.

Unit testing, integration testing, system testing

Testing is commonly distinguished by the granularity of the artifact to be tested. Unit tests focus on small units of functionality, such as a single function or a class. Integration tests analyze the interactions of multiple units at the granularity of multiple functions together, of components or of subsystems. Finally, system tests evaluate the entire system and test the system behavior from the perspective of an end user.

A unit test executes a single unit of code, an integration test executes the combination of multiple units, and a system test executes the entire system end to end.

Most developers focus most of their testing efforts at the level of unit tests, testing the inputs and outputs of individual functions, such as in the next_date tests above. Also in machine learning, most testing focuses on testing a single model in isolation. Unit tests are most easily automated.

Integration tests evaluate whether multiple units work together correctly. At the implementation level, automated integration tests look just like unit tests and use the same test frameworks, however they may execute multiple pieces of functionality, passing the results from the former to the latter, asserting that the eventual result is correct. Integration testing can also focus on correct error handling at communication between two components. We will show some examples of integration tests in the context of testing that multiple steps of a machine learning pipeline work together in chapter ML Pipeline Quality.

Finally, system tests tend to make sure that requirements are met from an end user’s perspective. They commonly follow use cases or user stories, of a user stepping through multiple steps of interacting with a system to achieve a goal, such as signing up for an account and buying a product recommended by the system, expecting to see a specific message and to receive a confirmation email (or an immediate reaction for certain inputs). In systems with a graphical user interface, libraries for GUI testing can be used to automate system-level tests, but they tend to be tedious and brittle. Many organizations perform system testing mostly manually or in production.

Test quality

While tests evaluate the quality of some software, we can also ask whether the tests are actually good at evaluating the software, that is, ask about the quality of the tests themselves. Tests are never perfect and just probe some of the infinitely many possible executions; tests can reveal defects but never guarantee the absence of defects. However, with better testing we explore more behavior of the software and are more confident having found most defects. So how can we assess the quality of our tests?

The most common approach to evaluate test quality is to measure coverage achieved by the tests. In a nutshell, coverage measures how much of the code has been executed by tests. If we have more tests or more diverse tests, we will execute more parts of the code and are more likely to discover bugs than if we focus all our testing on a single location. Importantly, coverage can highlight which parts of the implementation are not covered by tests and guide us to possibly write additional tests to cover them. In the simplest case, coverage is measured in terms of which lines of code are executed, and coverage tools will highlight covered and uncovered areas in reports and development environments, but it is also possible to assess coverage in terms of branches taken through if statements or how well our requirements are covered by tests. Tools to produce coverage reports are broadly available with all testing frameworks and development environments, for example, JaCoCo in Java and Coverage.py for Python.

Example of a coverage report, highlighting which lines have not been executed by any tests, which may inform what additional tests to add, for example tests about year changes and tests about the 400-year leap year rule.

Beyond coverage, the idea of mutation testing gains some popularity to evaluate test quality. Mutation testing essentially asks the question whether the tests are sensitive enough to detect typical bugs that might be expected in the implementation, such as off-by-one mistakes in conditions. To this end, a mutation testing framework will inject artificial bugs into the software, for example, by flipping comparison operators in expressions (e.g., replacing < by >) or adjusting decision boundaries (e.g., replacing a < b.length by a < b.length + 1). With each injected change, the tool will then run the tests to see whether the previously passing tests now fail. Tests that fail on more injected changes (said to have a higher “mutation score”) are more sensitive to detect such changes and are hence considered as of higher quality, because they are more likely to detect a real mistake of a similar kind in the program. The downside of mutation testing is that it can be computationally expensive, because the tests must be executed over and over again for each injected change.

Testing in production

Most testing is done in a controlled environment, such as on a continuous integration server with a test database that is reinitialized before every test. Controlling the environment helps to make tests reproducible and reliable and prevents buggy code from breaking a production system if it is used as part of a test. However, some behavior can be tricky to reproduce in an artificial setting. It may be difficult to make test inputs realistic in terms of content and frequency of production data, and some kinds of problems may only occur at scale or from complex interactions within production systems beyond what can be easily tested during integration testing.

Testing in production systems has become increasingly popular in recent years to complement traditional offline testing in controlled environments. The idea of testing in production is to observe the entire production system while introducing a chance — the change could be a release of a new version. We then observe how the system or its users reacts to the change — whether the change is noticeable in any engagement or quality metrics we measure for the system. The same mechanisms can be used to inject artificial faults in the production environment, such as rebooting a server to see whether the production system recovers gracefully, a process known as chaos testing. It can also be used to conduct an experiment to see which of two alternative implementations works better in production, known as A/B tests. To avoid disaster from bad changes and failed experiments, testing in production is typically limited to a subset of all users with safety mechanisms such as the ability to roll back the change quickly.

Since testing in production is useful for experimentation and particularly attractive when incrementally developing and testing machine learning models with production data, we will dedicate an entire chapter Quality Assurance in Production to the idea, concepts, and tools.

Code Review

Code review, traditionally also known as inspection, is a process of manually reading and analyzing code or other artifacts to look for problems. This is a static approach, in that reviewers analyze the code without executing it.

Today, the most common form of code review is lightweight incremental code review of changes as they are committed into a version control system: Essentially all big tech companies have adopted some code review process and tooling where each set of changes needs to be reviewed and approved by multiple others before it will be integrated into the shared code base. In open source, this is commonly achieved through reviewing pull requests before they are merged into the main development branch. Generally this style of code review happens early, quickly, and frequently as code is developed incrementally.

Example excerpt of code review of suggested changes in a pull request for the open source project Apache Kafka on GitHub. Here, the change’s author constructively discusses with other contributors possible ways to make the change clearer.

Studies show that this form of lightweight incremental code review is only moderately effective at finding serious defects, but it can surface issues that are difficult to test for, such as poor code readability, poor documentation, bad coding patterns, poor extensibility, and inefficient code. In addition, code review can be performed on artifacts that are not executable, such as requirements and design documents, incomplete code, and documentation. Importantly, code review has several other benefits beyond finding defects, particularly for information sharing and creating awareness (i.e., multiple developers know about each piece of code) and for teaching and promoting good practices.

In a machine learning context, code reviews can be extended to all code and artifacts of the machine-learning pipeline, including data wrangling code, training code, model evaluation reports, model requirements, model documentation, deployment infrastructure, configuration code, and monitoring plans.

There are also older, more formal and heavyweight approaches to code review. Fagan-style inspection is an approach where multiple reviewers independently and in a group read and discuss a fragment of code, which was popular and well studied in the 1980s. This style of focused review analyzes select, important pieces of code when it is ready to be released, rather than analyzing incremental changes as code is developed. Often checklists are used to focus the reviewer’s attention on specific issues, such as possible security problems. Fagan-style inspections have been shown to be incredibly effective to find bugs in software (more than any other quality assurance strategy), but they are also very slow and expensive. It is usually reserved for mission-critical core pieces of code in high-risk projects, for important design documents, and for security audits. In a machine learning context they may be used, for example, for reviewing the training code or for fairness audits. In today’s practice though, heavyweight inspections have almost entirely been replaced by the lightweight incremental code reviews, which are much cheaper even if less effective at finding defects.

Static Analysis

Static analysis describes a family of approaches to automatically reason about code without executing it. There are many different static analyses, typically categorized and named after the technical flavor of the underlying technique, such as type checking, code linting, data-flow analysis, or information-flow analysis. At a high level, static analysis can be seen as a narrow form of code review automatically performed by an algorithm, each focused on specific kinds of issues.

Static analysis tools examine the structure of source code and how data flows through source code, often looking for patterns of common problems, such as possibly confusing assignment (`=`) with comparison (`==`), violating style guidelines, calling functions that do not exist, including dead code that cannot be reached, and dereferencing a null pointer.

// suspicious assignment in if expression & type error
if (user.jobLevel = 3) ...
int fn() {
int X=1 ; // violation of style guidelines with missing
// whitespace and uppercase variable names

return Math.abss(X); // function is not defined
X = 3; // dead code
}
// possible null pointer exception if detailedLogging && !anyLogging
PrintWriter log = null;
if (anyLogging) log = new PrintWriter(...);
if (detailedLogging) log.println("Log started");

Static analyses can issue warnings about simple issues such as violations of code style conventions, about suspicious but not technically illegal code such as an assignment within an if expression, and about issues that are clear mistakes such as dead code or calls to undefined functions. Each static analysis is typically specialized for a specific kind of problem, but common analysis tools collect many different analyses. Some static analyses can guarantee that certain narrow kinds of problems do not occur if the analyses does not report any issues (at the cost of possible false warnings) — for example static type checkers as in Java and C# prevent developers from calling methods that do not exist and some security analyses can guarantee the absence of cross-site-scripting vulnerabilities. However most common static analyses follow heuristics that point out common problems but may miss others.

For dynamically typed and highly flexible languages such as Python and JavaScript, static analysis tools are usually weaker than those for Java or C; they tend to focus on style issues and simple problem patterns. For example, Flake8 is a popular style checker for Python. Static analyses for specific problems related to machine learning code and data transformations have been proposed in academia, but they are not commonly adopted.

Other Quality Assurance Approaches

Several other quality assurance approaches exist. While a detailed discussion would exceed the scope of this book, we provide a quick overview since some come up in discussions around machine-learning code too.

Dynamic analysis is a form of program analysis where additional information is tracked while the program is executed by a test or in production. Dynamic type checking is a typical example for a dynamic analysis built into most programming languages, where each runtime object is associated with a runtime type, preventing the interpretation of strings as numbers without appropriate conversion and enabling more informative error messages when a problem occurs during an execution. Profiling is a dynamic analysis that measures execution time of functions to observe performance bottlenecks. Provenance tracking is common in research on databases and big data systems where the origin of values in a computation is tracked at runtime, for example to identify which rows of which dataset were involved in computing a result. Dynamic analyses are typically implemented either (a) directly in the runtime system (e.g., in the Python interpreter) or (b) by instrumenting the code before execution to track the desired information.

Formal verification mathematically proves that an implementation matches a specification. Assuming that the specification is correct, formal verification provides the strongest form of assurance — if we can prove a program correct with regards to a specification, we can be sure that it performs exactly as specified. Formal verification is expensive though, requires substantial expertise and effort, and cannot be fully automated. It is rarely used for quality assurance in real-world projects in practice outside of mission critical cores fragments of a system.

Finally, some assurance strategies investigate process quality rather than (or in addition to) the quality of the actual software. That is, we assure that the software was developed with a well defined and monitored process. This may include assuring that the team follows good development practices, documents and approves requirements, performs architectural planning, uses state-of-the-art development and quality assurance tools, checks for certain kinds of problems, tracks known defects, uses a mechanism to identify and eliminate recurring problems (root cause analysis), analyzes and tracks risks, uses version control, uses change management, monitors developer progress, and much more. Many process improvement methods, such as Six Sigma, Kaizen, and Lean Manufacturing stem from engineering and manufacturing, but have also been adopted to software processes. Organizations’ attention to process quality are often characterized by maturity models — where organizations that define their processes, follow them, and monitor them to look for improvement opportunities are considered more mature. The best known example of a maturity model is likely the Capability Maturity Model for traditional software, but maturity models have been also suggested for testing of machine-learning systems and machine learning process maturity generally. Developing software with a high-quality process does not guarantee that the resulting software artifacts are of high quality themselves, but it can avoid problems stemming from sloppy practices. Also, a strong focus on process quality is often disliked by developers and associated with overwhelming bureaucracy.

Planning and Process Integration

Quality assurance methods are only improving software quality if they are actually used. Developers left on their own are notorious for focusing on completing new functionality rather than assuring the quality of their work. Many developers actively dislike testing, do quick and shallow code reviews when forced, and ignore warnings from static analysis tools if they even run them in the first place. A key question hence is how to get team members to invest in quality assurance and take quality assurance seriously.

First of all, managers and teams should include time and resources for quality assurance when making plans. When under constant deadline pressure, developers may feel forced to cut corners, often creating unpleasant work environments where developers are dissatisfied with their own work and working conditions.

To this end, quality assurance activities should be planned, just like other development activities. This includes developing a test plan as well as setting quality goals when identifying requirements and designing components. For example, when identifying requirements, also discuss what system tests could be used to evaluate whether the requirements were met. When assigning a milestone for completing a component to a developer or team, the milestone should not only require the completion of all coding activities, but also require having demonstrated that the code meets agreed-upon quality goals (e.g., meeting testing requirements, finishing code review and resolving all comments, passing static analysis, meeting the customer’s expectations). This style of pairing design and development activities with test activities is commonly described as the V-model.

The V-Model of system development describes how quality assurance strategies are matched to requirements and design activities in the system development process. This figure shows one of many common variations of this diagram linking software development stages (as discussed in the process chapter) to quality assurance activities.

Some organizations have dedicated quality assurance or testing teams, separate from development teams, who need to sign off on passing quality goals. In today’s practice though most developers perform most quality assurance activities themselves. Still, dedicated testing teams with specialists can be useful to provide an independent perspective, particularly in later stages, focused on system testing, security audits, penetration testing, and fairness audits.

Screenshot of static analysis warning shown within Google’s in-house code review tool, from Sadowski, Caitlin, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. “Lessons from building static analysis tools at Google.” Communications of the ACM 61, no. 4 (2018): 58–66.

Finally, by integrating quality assurance practices with tooling into the development process, it is often possible to require developers to perform certain steps. There are many successful examples of such process integrations:

  • Code reviews are required in many organizations such that code cannot be merged unless approved by at least two other developers (technically enforced).
  • Continuous integration tools run all tests and refuse to merge changes that break tests (technically enforced).
  • The version control system refuses to merge code changes if they are not accompanied by unit tests or if they decrease overall code coverage (can be technically enforced).
  • Google has been successful in getting developers to pay attention to static analysis warnings with a bit of social engineering, by surfacing those warnings during code review, where developers are already open to feedback and the presence of other reviewers provides social pressure to address the warnings, which is now also possible with tool integration for pull requests on GitHub (not technically enforced, but creating social pressure).
  • Project managers do not let developers move on to the next task before their previous milestone is complete with tests and documentation (enforced by management).

Establishing a quality culture, in which the organization values software quality enough to make it a priority, takes time, effort, and money. Most developers do not want to deliver low quality software but many feel pressured by deadlines and management demands to take shortcuts. As with all culture change, establishing a quality culture requires buy-in from management and the entire team, supported with sufficient resources, and not just wordy ambitions or compliance to rules imposed from above. Defining processes, observing outcomes, avoiding blame, performing retrospective analyses, and celebrating successes can all help to move to a culture where quality is valued.

Summary

There are many strategies to assure quality of software, all of which also apply to software with machine learning components. Classic quality assurance techniques include testing, code review, and static analysis. All of these are well established in software engineering. To be effective, these activities need to be planned, automated where possible, and integrated into the development process.

Further Reading

  • Quality assurance is prominently covered in most textbooks on software engineering and dedicated books on testing and other quality assurance strategies exist, such as Copeland, Lee. A practitioner’s guide to software test design. Artech House, 2004; Aniche, Mauricio. Effective Software Testing: A Developer’s Guide. Simon and Schuster, 2022; and Roman, Adam. Thinking-driven testing. Springer International Publishing, 2018.
  • Great coverage of many different forms of code review and inspection, including advice for more heavyweight strategies: Wiegers, Karl Eugene. Peer reviews in software: A practical guide. Boston: Addison-Wesley, 2002.
  • An excellent study of modern code review (incremental review of changes) at Microsoft highlighting the mismatch between expectations and actually observed benefits: Bacchelli, Alberto, and Christian Bird. “Expectations, outcomes, and challenges of modern code review.” In 2013 35th International Conference on Software Engineering (ICSE), pp. 712–721. IEEE, 2013.
  • Article describing the multiple attempts to introduce static analysis at Google and the final solution of integrating lightweight checks during code review: Sadowski, Caitlin, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. “Lessons from building static analysis tools at Google.” Communications of the ACM 61, no. 4 (2018): 58–66.

--

--

Christian Kästner

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling