Data Quality for Building Production ML Systems

“Data cleaning and repairing account for about 60% of the work of data scientists.” — Gil Press. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes Magazine, 2016.

Machine learning approaches rely on data to train models. In production, user data, possibly enhanced by data from other sources, is fed into the model to make predictions. Production machine learning systems additionally collect telemetry data in production and make decisions based on the collected data. With all that reliance on data, system quality can suffer massively if data is of low quality. Models may be performing poorly or may be biased or may be outdated, user decisions based on predictions may be unreliable, and so may operating decisions based on telemetry. A system that does not attend to data quality may become brittle and will likely degrade over time, if they work in the first place.

This chapter provides an overview of different problems and solutions regarding data quality.—It corresponds to the “data quality” lecture of our Machine Learning in Production cours; for other chapters of our upcoming book see the table of content.

Case study: Inventory Management

Image for post
Image for post

With thousands of products from different vendors with different shelf life, different delivery times and modalities, changing consumer behavior, the influence of discounts and other marketing campaigns from the supermarkets as well as those from the vendors, and much more, predicting sales and needed purchases for a supermarket is a very challenging and changing problem with significant impact on the supermarket’s profits and their customers’ experiences that could benefit from multiple machine-learned models.

For now, let’s assume the supermarket has several databases tracking products (e.g., id, name, weight, size, description, vendor), current stock (e.g., mapping products to store locations, tracking quantity and expiration dates), and sales (e.g., id, user id if available, date and time, products, weights, and prices).

Data quality challenges

  • Accuracy: The data was recorded correctly.
  • Completeness: All relevant data was recorded.
  • Uniqueness: All entries are recorded once.
  • Consistency: The data agrees with each other.
  • Timeliness: The data is kept up to date.

Achieving good data quality can be challenging in all these dimensions. For example, in our inventory system, somebody could easily enter the wrong product or the wrong number of items when a shipment is received (accuracy), could simply forget an item or an entire shipment (completeness), could enter the shipment twice (uniqueness), could enter different prices for the same item in two different tables (consistency), or simply could forget to enter the data until a day later (timeliness).

In pretty much any real-world system, including certainly almost all production machine-learning systems, data is noisy, rarely ever meeting all quality criteria above. Noise can have many different sources:

  • When manual data entry is involved, such as a supermarket employee entering received shipments, mistakes are inevitable. Automated data transfer, such as electronic sales records exchanged between vendor and receiver, can reduce the amount of manual data entry and associated noise.
  • Many systems receive data through sensors that interact with the real world and are rarely fully reliable. This can range from mistakes when scanning documents in the inventory system (e.g., poor writing/printing, crumpled paper, smudged camera) to poor quality from lidar and video sensors in self-driving cars (e.g., dealing with fog, raindrops on the lense, titled cameras).
  • Data created from computations can suffer quality problems when the computations are simply wrong or they crash (e.g., leading to missing or duplicate values). For example, the supermarket cashier might enter a sale twice if the first attempt crashes with an error message after the data has already been written to the database, or a network outage may cause the loss of sales records during the outage. Nondeterminism in computer applications, especially in parallel and distributed systems, can lead to all kinds of faults such as data being processed out of order.

Data changes

Among others, the supplied data’s data quality may change as users are better trained at manual data entry or as the sensor of the used scanner ages. Similarly, assumptions we have about how to interpret data from the environment may no longer hold, as the meaning of certain data may change, possibly because the process for collecting the data has changed; for example, when the measured weight of delivered produce is no longer accounting for the weight of the box it is stored in (previously we would always deduct some weight to calibrate).

Software components that produce data may change during software updates and change the way the data is stored, such as storing 2-for-1 deals only as selling one at full price and one for free rather than as selling two items for half price; such changes can throw off downstream components that read and analyze such data, for example, if it had previously filtered all items with a 0.00 price. Similarly, updates to machine-learning models may improve or degrade the quality of the produced data, such as the accuracy of the predicted supermarket sales.

Finally, also system objectives may change and this influences the requirements on data quality for certain parts of the system. For example, if a theft prediction component is added to the inventory management system, the timeliness of recording missing items may matter much more than without such components.

In addition, user behavior changes over time, changing what data we expect as common or as outliers, such as unusual high sales of certain supplies when a snowstorm is predicted, or increased purchase of certain items as a diet fad takes hold. In addition to natural changes in user behavior, users can also deliberately change their behavior. This can happen as an intended reaction to model predictions, such as a model suggesting to lower the price to increase sales of certain overstocked or soon-to-expire items, but it can also happen as an attempt to game or deceive the model, such as a vendor buying their own items to simulate demand (see security lecture for more detailed coverage).

A quick detour: Accuracy and precision

Similar to the accuracy of machine-learning predictions, with the accuracy of a measurement process, we evaluate how closely data (on average) represents the real value we want to represent (often some quantity in the real world). For example, the accuracy of our inventory data is measured in terms of how closely it represents the actual inventory of our warehouse.

In contrast, precision refers to how reliably a measurement process produces the same result (whether correct or not). That is, precision is a measure of measurement noise. For example, a scale for weighing incoming produce typically produces a fairly good measure, but if we weigh the same produce multiple times the value always differs by a few grams, and sometimes it might even be off by quite a bit.

Ideally, we want measures that are accurate and precise, but that is not always the case. We can have measures that are imprecise but accurate, as the scale above: If we just repeat the measurement multiple times and average the observations, we get a pretty accurate result (assuming random noise). In contrast, we can also have measures that produce very precise but inaccurate data, such as a clock that is running behind: It will consistently tell a wrong time, no matter how many measurements we average over.

Image for post
Image for post
Visualizing accuracy and precision. CC-BY-4.0 by Arbeck

In the context of data quality, we need to deal both with inaccuracy and imprecision (noise) in data. Imprecision is usually easier to handle, because we can use statistics to deal with noise in measurements. While we may not have multiple observations for the same data point, noise may cancel each other out over time. For example, if we logged too many bananas today due to random measurement noise, it may be too little tomorrow, and won’t affect our system overall too much. In comparison, inaccuracy is much more challenging because it represents a systematic problem in our measures, that cannot be detected by statistical means. For example, if we consistently sold more bananas than we recorded with a precise but miscalibrated scale, we might run out of bananas early. To detect inaccuracy in a data generation process, one needs to systematically look for problems and biases in the measurement process or somehow have access to the true value to be represented.

Almost all data produced through automated computations rather than human or sensor inputs is almost always highly precise (repeatedly running the same algorithm with the same inputs will produce the same outputs) but not necessarily accurate if there are bugs in the computation. Data produced from other data through machine-learned models, such as predicting tomorrow’s sales based on today’s sales, is also typically perfectly repeatable (most machine-learning models are deterministic for a given input; see reproducibility lecture) but not necessarily accurate. Across multiple data points, model predictions can be noisy (imprecise) or systematically biased (inaccurate), depending on the model and problem.

Data quality and machine learning

In general, more data leads to better models, up to a point. Typically, there are diminishing effects of adding more and more data.

Noisy data (imprecision) may lead to less confident models and occasionally spurious predictions if the model overfits to random measurement noise. However, noise in training data, especially random noise, is something that most machine-learning techniques can handle quite well, especially if a lot of data is available.

In contrast, inaccurate data is much more dangerous, because it will lead to misleading models that make similarly inaccurate predictions. For example, a model trained on sales data with a miscalibrated scale systematically underrepresenting the sales of bananas will predict a too low need for ordering bananas. Also note that inaccurate data with systematic biases rather than noisy data is the source of most fairness issues in machine learning (see fairness lecture), such as recidivism models trained with data from past judicial decisions that were biased systematically against minorities (i.e., systematic rather than random measurement problem).

For the machine learning practitioner, this means that both data quantity and data quality are important. Given typically a limited budget, one needs to carefully trade off costs for acquiring more data with costs for acquiring better data or cleaning data. It is sometimes possible to gain great insights even out of low quality data, but low quality can also lead to completely wrong and biased models, especially if data is systematically biased rather than randomly noisy.

Exploratory Data Analysis

Exploratory data analysis is the process of exploring the data, typically with a goal of understanding certain aspects of it. This often starts with understanding the shape of the data: What data types are used (e.g., images, tables with numeric and string columns)? What are the ranges and distributions of data (e.g., what are typical prices for sales, what are typical expiration dates, how many different kinds of sales taxes are recorded)? What are relationships in the data (e.g., are prices fixed for certain items)? To understand distributions it is often a good idea to plot distributions of individual columns (or features) with boxplots, histograms, or density plots or to plot multiple columns against each other in scatter plots; computational notebooks are a well-suited environment for this kind of visual exploration.

It is often a good idea to explore trends and outliers. This might sometimes give a sense of precision and typical kinds of mistakes. Sometimes it is possible to go back to the raw data or people with domain knowledge to check correctness; for example, does this vendor really charge $10/kg for bananas, or did we really sell 4 times the normal amount of toilet paper last June in this region?

Beyond visual techniques, a number of statistical analyses and data mining techniques can help to understand relationships in the data, such as correlations and dependencies between columns. For example, the date an item is delivered hopefully correlates with its expiration date or the sale of certain products might associate with certain predominant ethnicities in the supermarkets’ neighborhoods. Common techniques here are association rule mining and principal component analysis.

When building production machine learning systems, it is important to understand data, its shape, and its quality in a lot of places in the system, and exploratory data analysis is a useful first step for all of these. This includes understanding input and output data of machine-learned components, both training data and live data in production, but also telemetry data. It is a good practice to document all assumptions made about the data for modeling and expectations for data at runtime. Many such assumptions can then be encoded as data quality checks at runtime.

Data Cleaning — A Quick Overview

  • Single-source schema-level problems where integrity rules are violated or uniqueness constraints are not observed within data from a single source, that is, the data does not match certain formatting constraints. For example, in the dataset below, two customers share an ID (uniqueness violation), the customer table contains zip codes not included in the address table (cross-table integrity violation), the date field contains entries that are not actually dates (integrity/format violation), and the birthday does not consistently relate to the recorded age (cross-column integrity violation).
  • Single-source instance-level problems where individual entries in a dataset are incorrect (accuracy problems), such as the misspelled city name Sam Francisco, the invalid phone number, or a birthdate that does not actually belong to the person.
  • Multi-source problems refer to inconsistencies between multiple data sources that can occur both at the schema level (consistency, uniqueness), say multiple sources tracking information about the same persons but using inconsistent column names or IDs, and at the instance level (accuracy), say when times are recorded inconsistently in multiple sources.
Image for post
Image for post
Example of some problematic dataset.

Cleaning data usually requires two steps: First errors need to be detected and afterward, they can be repaired.

Regarding detection, beyond manual detection that tends to be tedious, there are many different automated detection strategies. The most common ones involve checking data against schemas and constraints/rules or for duplicate values (discussed in more depth below); more sophisticated techniques compare data against expected distributions to detect outliers or misspelled words to detect issues probabilistically. It can be quite common to use machine learning for this step, as for example in HoloClean.

Once detected, there are again many different strategies for cleaning. For schema and rule violations it can be common to reject the data (either delete it or require human intervention to fix it). Sometimes exploring the sources of errors can reveal more permanent fixes, such as ensuring that missing values are detected at the source and collected again for temporary sensor failures. For outliers, removal and normalization are again common in many contexts; one may also employ probabilistic repair where a likely expected value (that aligns with other data in the dataset) is substituted for a suspicious value.

Many commercial, academic, and open-source data cleaning tools exist targeted at different communities, such as OpenRefine (formerly Google Refine), Trifacta Wrangler, Drake, and HoloClean.

Data Schema

Relational database schemas. Schemas are likely most familiar to users of relational databases. For example, the table description below requires that the employee table has three columns with the names emp_no, birth_date and name, and those columns may accept integers, dates, and strings respectively, but no null values. In addition, the employee number must be unique across all entries in the table. Entries in the dept_manager table have extra constraints in that the entries may only correspond to entries existing in the employee and departments tables.

CREATE TABLE employees (
emp_no INT NOT NULL,
birth_date DATE NOT NULL,
name VARCHAR(30) NOT NULL,
PRIMARY KEY (emp_no)
);
CREATE TABLE departments (
dept_no CHAR(4) NOT NULL,
dept_name VARCHAR(40) NOT NULL,
PRIMARY KEY (dept_no), UNIQUE KEY (dept_name)
);
CREATE TABLE dept_manager (
dept_no CHAR(4) NOT NULL,
emp_no INT NOT NULL,
FOREIGN KEY (emp_no) REFERENCES employees (emp_no),
FOREIGN KEY (dept_no) REFERENCES departments (dept_no),
PRIMARY KEY (emp_no,dept_no)
);

Traditional relational databases typically require a basic schema for all data and they refuse to accept new data or operations on existing data that violate the constraints in the schema. For example, a relational database would return an error message when trying to insert an employee with an ill-formatted date or an employee number that is already in the database.

Schemas can enforce all kinds of different constraints on allowed data, and different systems support different mechanisms. Requiring some notion of completeness (e.g., nonnull values), uniqueness (e.g., unique credit card numbers in user accounts), and conformance to types and value ranges are the most common constraints, as are integrity constraints that relate data in multiple columns or tables (e.g., zip code and city name should match), but many other forms of constraints could be considered. More advanced schema constraints may include requirements for the data to follow certain distributions, where schema violations could be used to reject data or trigger alerts. For example, if the supermarket’s amount of delivered bananas strongly exceeds the normal range, the entry could be flagged for review or the row could be removed.

Schemaless data in ML systems. In machine-learning contexts, data is often exchanged in schemaless formats, such as text entries in log files, tabular data in CSV files, tree-structured documents (json, XML), key-value pairs, and so on. Such data is regularly communicated through plain text files, through REST APIs, through message brokers, or stored in various (often non-relational) databases. For example, the Movielense dataset is stored in a custom text format listing an ID, the movie’s title and year, and a list of genres.

1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance

Exchanging data without any schema enforcement can be dangerous, as changes to the data format may not be detected easily and there is no way of communicating expectations on the data format and what constitutes valid and complete data between a data provider and a consumer. In practice, this may mean that the consumer needs to carefully parse the data and prepare for missing or invalid data. Some monitoring may be needed to detect when all or most incoming data can no longer be recognized, which may indicate format changes. Even worse, without documenting expectations between the producer and the consumer, subtle format changes may go unnoticed. For example, if a movie rating was previously communicated on a 1-to-5 star scale, but is now changed simply to a positive/negative rating encoded as 0 and 1, the consumer can still parse the data but may not notice that the meaning has changed.

Enforcing schemas beyond relational databases. Practitioners have realized that data schema enforcement is important when exchanging data, even for textual data exchange outside of relational databases.

Many tools have been developed to define schemas and check whether data conforms to schemas for key-value, tabular, and tree-structured data. For example, XML Schema is a well-established approach to limit the kind of structures, values, and relationships in XML documents; similar schema languages have been developed for JSON and CSV files. Great Expectations is a Python-based library that allows adding schema-style checks for data frames in many data-science workflows.

expect_column_values_to_be_between(
column=”passenger_count”,
min_value=1, max_value=6)

A schema language additionally provides the opportunity to encode data more efficiently in binary formats (similarly to what happens internally in relational databases) than when stored textually. For example, numbers can be serialized in binary format, rather than written as text; for structured data, it is not necessary to repeat the field names for every entry. Many modern libraries combine data schema enforcement and efficient data serialization, such as Google’s Protocol Buffers, Apache Avro and Apache Parquet. The following is an example of a schema for Avro defining a record with two named and typed fields:

{   "type": "record",
"namespace": "com.example",
"name": "Customer",
"fields": [{
"name": "first_name",
"type": "string",
"doc": "First Name of Customer"
},
{
"name": "age",
"type": "int",
"doc": "Age at the time of registration"
}
]
}

Many of these libraries provide bindings for programming languages to easily read and write structured data, exchange it between programs in different languages, all while assuring conformance to a schema. Most of them also provide explicit versioning of schemas that ensure that changes to the schema on which producers and consumers rely will be detected automatically.

Several more advanced tools, such as those published by Google and Amazon (see readings at the end), also allow monitoring and checking distributions of data, to detect outliers and drift over time. The Great Expectations library supports some distribution assertions like expect_column_kl_divergence_to_be_less_than.

Inferring the schema. Developers usually do not like to write schema and often enjoy the flexibility of schemaless formats and nosql databases, where data is flexibly stored in arbitrary tabular, key-value, or tree structures (e.g., JSON).

To encourage the adoption of schemas, it is possible to infer likely schemas from sample data. For example, one can detect that all entries in the first column of a data frame are unique numbers, all entries in the second column are text, all entries in the third column are either 1, 2, or 3 (suggesting an encoding of a finite number of values), and that values in the last column are uniformly distributed floating point numbers between 0 and 1. More complicated constraints across multiple columns can be detected with specification mining, rule mining, or schema mining techniques — -for example, techniques based on association rule mining or invariant detection (see Ilyas and Chu’s Data cleaning book for a good overview). For example, TFX’s data quality mechanism provides a schema definition language and a simple inference strategy. While inference may not be precise (especially if the data is already noisy and contains mistakes), it may be much easier to convince developers to look over an inferred schema than asking them to write one from scratch.

Distributions over data can also be learned from past data and recorded to compare distributions in future data, either as part of a monitoring scheme or as part of rejecting or even automatically repairing data entries.

Image for post
Image for post
Inconsistencies detected in distributions and across rows and rules by automatically mining patterns in the data with HoloClean. From: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean — Weakly Supervised Data Repairing.” Blog, 2017.

Detecting Drift

There are several different forms of drift that are worth distinguishing (though these distinctions are not always explicit in the literature). While the effect of degrading model accuracy is similar for all of them, there are different causes behind them.

Data drift (also known as covariate shift or population drift) refers to changes in the distribution of the input data for the model. The typical problem with data drift is that over time input data differs from training data, leading our model to attempt to make predictions for out-of-distribution data. The model simply has not learned the relevant concepts yet.

For example, in the supermarket’s self-checkout system, a camera detects the kind of produce a customer is trying to weigh. While cucumbers usually look fairly similar and we have collected lots of training data to detect them reliably, a bad harvest or a local farmer producing an heirloom variety may lead to the supermarket selling cucumbers that actually look quite different from the cucumbers in the training data and that may be confused with squash.

Data drift can be detected by monitoring the input distributions and how well they align with the distribution of the training data. If the model attempts many out-of-distribution predictions, it may be time to gather additional training data to more accurately represent the new distribution. It is usually not necessary to throw away or re-label old training data, as the relationship between inputs and outputs has not changed (the old cucumbers are still cucumbers even if not many look like them during this season). Old training data may be discarded if the old corresponding input distributions are no longer relevant, e.g., we decide not to sell any cucumbers anymore or only the new heirloom variety.

Concept drift (or concept shift) refers to changes in the problem or its context over time that leads to changes in the relations between inputs and expected predictions. Concept drift may be best thought of as cases where labels in a training dataset need to be updated since the last training run. It usually occurs because there are hidden factors that influence the expected label for a data point that is not included in the input data.

For example, let’s say we want to predict how many cucumbers we are selling, based on last month’s sale and the month to capture seasonal effects (i.e., f(lastsales, month) = predicted sales). However, a new popular fad-diet now suggests that it is important to eat lots of cucumbers every day in May and August and avoid cucumbers at all in the fall. The old model will no longer predict the demand accurately because the underlying relationship between inputs (last month’s sales and current month) and outputs (sales this month) have changed due to changes in the context that has not been modeled (i.e., diet trends). We still make predictions in similar distributions (usual ranges of sales from a prior month and the same range of 12 months), but we now expect different outputs for the same inputs than we may have expected a year or two ago. Underlying concepts have changed.

Other common examples of concept drift are in security settings where the concept of what kind of transaction is considered as likely credit card fraud or what kind of activity on a server is considered as a likely intrusion may change over time. That is, for some inputs, the exact same transaction or behavior would be classified differently today than it would have been a week ago.

To detect concept drift, we cannot simply monitor input or output distributions. Instead, we need to reason about how frequently predictions were wrong, typically by analyzing data from telemetry. If the system’s prediction accuracy degrades without changes in the input distributions, concept drift is a likely culprit. In this case, the model needs to be updated with more recent training data that captures the changed concept. We need to either relabel old data or collect new training data, throwing out old training data that reflects the outdated input-output relationship.

Schema drift refers to changes in the data encoding or assumptions about the meaning of the data. Schema drift typically emerges from technical changes in a different component of a system. If not detected, the data is interpreted incorrectly by the model when making predictions or by the training code.

For example, a point-of-sale terminal in a supermarket may start recording all weights of sold produce in kilogram rather than pounds after a software update. In this case, all future data will look very different, even though it may not be detected easily by the system (it’s just numbers before and after); it will probably just result in systematically wrong predictions after the change. Another example would be a change where recorded transactions now contain the employee name rather than their id; in this case we might detect that a data type changes and possibly trigger a crash in the existing feature engineering code trying to convert a name into a number. As a third example, consider a movie streaming service where user ratings are changed from a 1-to-5 star rating system (encoded numerically as 1..5) to a thumbs-up thumbs-down system (encoded as 1 or 2). If the change is not carefully communicated, the consumer of the ratings may not notice the difference and perceive that suddenly all movies are rated very poorly.

While data drift and concept drift come from the environment about which we have little control, schema drift is a technical problem that can be addressed with technical means. Ideally, data exchanged between producers and consumers follow a strict schema, and that schema is versioned and enforced, revealing mismatching versions as a technical error at compile time or runtime. Semantic changes (e.g., what do the numbers in an enumeration mean or the star rating change above) can also be described in the schema, though human analysis may be needed to analyze the impact of a change. Distribution of value ranges can be monitored to detect suspicious changes that may stem from any form of drift, including schema drift.

Detecting and coping with drift. To detect drift it is necessary to monitor the model in production. Ideally, we can monitor the model’s prediction accuracy in production with telemetry, even if it is just a rough proxy for a traditional accuracy measure (see testing in production lecture). To detect data drift, it is technically sufficient to observe the distribution of input data at runtime and compare it to the distribution of the training data, without having to reason about prediction accuracy; more formal models for detecting out-of-distribution inputs exist and a simple technique is to just track the distance between inputs and the closest training data (e.g. wasserstein and energy distance). To detect concept drift, it is also possible to periodically re-label some of the training data (e.g., with crowd workers) to detect whether labels change for the same data over time. If the learned model is interpretable (see interpretability and explainability lecture), human experts may also identify where learned rules no longer capture changed concepts. One can find many discussions on drift detection, typically with statistical models, in academic machine learning papers.

Image for post
Image for post
Image source: Joel Thomas and Clemens Mewald. Productionizing Machine Learning: From Deployment to Drift Detection. Databricks Blog, 2019

The most common strategy to cope with drift is to periodically retrain the model on new or re-labeled training data. The point for retraining can potentially be determined by observing when the model’s prediction accuracy falls below a certain threshold.

If drift can be anticipated, it may be possible to prepare for it within the machine-learning pipeline. For example, if we can anticipate data drift we may be able to proactively gather training data for inputs that are less common now but expected to be more common in the future, such as training on heirloom varieties of different vegetables now even though we do not sell them yet. We may be able to simulate certain anticipated changes, such as degrading camera quality when taking pictures of produce at checkout. If we can anticipate concept drift, we may be able to encode more concepts as features for the model; for example, a feature whether the (anticipated) cucumber fad diet is currently popular or whether we are expecting strong variability in sales due to a winter storm warning. Ask your data scientist for details.

Excursion: Detecting Suspicious Data Encoding (“Data Linting”)

// using = instead of == in an if condition is likely unintended
if (user.jobTitle = "manager") {
...
}
// dead code after the return statement:
function fn() {
x = 1;
return x;
x = 3;
}
// some code paths (with detailedLogging on but anyLogging off)
// could result in a null pointer exception:
PrintWriter log = null;
if (anyLogging) log = new PrintWriter(...);
if (detailedLogging) log.println("Log started");

These kinds of tools are typically called code linters (usually simpler heuristic analysis and pattern matching) or static analysis tools (usually more sophisticated analyses reasoning about all possible executions). Some tools focus on specific classes of problems, such as security problems or null pointer exceptions. Many such tools are also marketed as a mechanism to quantify technical debt or code quality. The term code smell is also sometimes used to refer to code fragments that raise suspicion for defects or design quality issues. Many open source and commercial tools are widely available, such as ESLint, CheckStyle, SpotBugs, SonarQube, and Coverity.

A linter for data problems. An interesting direction to explore is whether such kinds of analyses can be extended to data quality problems and data encoding problems in machine learning contexts. Tools could inspect the data cleaning or feature engineering code to look for bad patterns, or could look directly at the shape of the data used.

The Data Linter project from Google does exactly that: It inspects datasets and looks for patterns that raise suspicion for common problems. Problems are detected with heuristics (developed based on experience from past problems). Not every reported problem is an actual problem, but the tool points out suspicious data shapes that may be worth exploring further.

The tool has many different kinds of problems it looks for, including

  • Data coding problems, e.g., numbers, dates, and times encoded as strings, enums encoded as reals, long strings that are not tokenized, zip codes encoded as numbers
  • Outlier and scaling problems, e.g., feature columns with widely varying results that probably should be normalized, tailed distributions, uncommon sign
  • Packaging problems, e.g., duplicate rows and missing data

Many of these potential problems relate to how data is used by subsequent machine-learning steps, which, for example, work better with normalized feature values or have a hard time understanding times when encoded as strings.

While many assumptions could be codified in a data schema (e.g., types of values in columns, expected distributions for columns, uniqueness of rows), the key point of the data linter is to identify suspicious use of data where a schema either has not been defined or the schema is wrong or too weak for the intended use of the data.

Quality Assurance for the Data Processing Pipelines

  • Write testable data acquisition and feature extraction code
  • Test all data transformation and feature extraction code (unit test, positive and negative tests)
  • Test retry mechanism for acquisition, test error reporting so that acquisition problems are noticed in production
  • Test correct detection and handling of invalid inputs
  • Catch and report errors in feature extraction
  • Test correct detection of data drift, for example, run “fire drills”
  • Test correct triggering of the monitoring system, again use “fire drills”

Summary

In production machine learning systems, data typically comes from many sources and is often inaccurate, imprecise, inconsistent, incomplete, or has other quality problems. Hence deliberately thinking about and assuring data quality in production-machine learning productions is important. A first step is usually understanding the data and possible problems, here classic exploratory data analysis techniques are a good starting point.

To detect data quality problems and clean data, there are many different techniques. Most importantly, data schemas ensure format consistency and can encode additional rules to ensure invariants across columns and rows, in terms of distributions and data dependencies. Many tools and libraries can help with defining, inferring, and enforcing schemas and other checks on data qualities and corresponding repairs. We also expect to see more tools that point out suspicious code and data, such as the data linter.

Beyond the quality of the existing set of data, it is also important to anticipate changes. Different forms of drift, data drift, concept drift, and schema drift, can all reflect changes in the environment over time that result in degraded prediction accuracy for a model. Planning for drift, preparing to regularly retrain models, and monitoring for indicators of drift in production are important to confidently operate a machine learning project in production.

Finally of course all that code used to assure and monitor data quality is infrastructure code that should be tested itself.

Additional resources

  • Similar efforts at Google, including infrastructure for schema inference: Polyzotis, Neoklis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. “Data validation for machine learning.” Proceedings of Machine Learning and Systems 1 (2019): 334–347.
  • Short tutorial notes providing a good overview of work in the database community: Polyzotis, Neoklis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. “Data Management Challenges in Production Machine Learning.” In Proceedings of the 2017 ACM International Conference on Management of Data, 1723–26. ACM.
  • The idea of a data linter and how it helped find various problems in machine learning pipelines: Nick Hynes, D. Sculley, Michael Terry. “The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets.” NIPS Workshop on ML Systems (2017)
  • Using machine learning ideas to identify likely data quality problems and repair them: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean — Weakly Supervised Data Repairing.” Blog, 2017.
  • Overview of different kinds of data cleaning problems: Rahm, Erhard, and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23.4 (2000): 3–13.
  • Book with detailed discussion of many different data cleaning approaches: Ilyas, Ihab F., and Xu Chu. Data cleaning. Morgan & Claypool, 2019.
  • Formal statistical characterization of different notions of drift: Moreno-Torres, Jose G., Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V. Chawla, and Francisco Herrera. “A unifying view on dataset shift in classification.” Pattern recognition 45, no. 1 (2012): 521–530.
  • Discussion of data quality requirements for machine learning projects: Vogelsang, Andreas, and Markus Borg. “Requirements Engineering for Machine Learning: Perspectives from Data Scientists.” In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019.
  • Empirical study of deep learning problems highlighting the important role of data quality: Humbatova, Nargiz, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. “Taxonomy of real faults in deep learning systems.” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1110–1121. 2020.

As all chapters, this text is released under Creative Commons 4.0 BY-SA license.

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store