Automating the ML Pipeline
This chapter discusses automating some or all steps of ML pipelines in production projects with machine learning to make it easier to update models and experiment. For other chapters see the table of content.
After discussing how and where to deploy a model, we now focus our attention on the infrastructure to train that model in the first place. This infrastructure is typically described as a pipeline consisting of multiple stages, such as data cleaning, feature engineering, and model training. A set and forget approach where a model is trained once and never updated is usually considered as an antipattern. Investing in the automation of the machine-learning pipeline eases model updates and facilitates experimentation. Poorly written machine-learning pipelines are commonly derided as pipeline jungles or big-ass script architecture antipatterns and criticized for poor code quality and dead experimental code paths. While the training process appears on the surface as a fairly local and modular activity by data scientists, it interacts with many other components of an ML-enabled system, as we will show. This suggests the importance of careful planning of the machine-learning pipeline in the context of the entire system.
Supporting Evolution and Experimentation by Designing for Change
A key design principle in software engineering is to anticipate change and design software such that it is easy to change. The key idea is to encapsulate the part of the system that is likely to change within a module but to design module interfaces such that the change will not affect the interface, hence change can be performed locally within a module, without affecting other modules of the system.
In ML-enabled systems we can anticipate that the model will change regularly and should design the system accordingly. We may want to update the model as data scientists continue to innovate, for example, with better feature engineering or better selection of hyperparameters. We may want to update the model as new versions of the used machine-learning library are released. We may want to update the model once we acquire new training data, possibly from production data. Especially when data distributions or labels change over time (drift, discussed in more detail in chapter Data Quality) we expect to regularly retrain the model with new data in a form of continuous learning or closed-loop intelligence. Finally, data science tends to follow a highly exploratory process where we might simply want to experiment with lots of different models to see how they are doing, either offline or online in production. Automation of the pipeline can facilitate easy change and experimentation by avoiding error-prone manual steps and ensuring robustness of the infrastructure.
Traditional software engineering strategies for design for change rely on modularity and strong abstractions at module interfaces and are a good match also for ML-enabled systems:
- The output of machine-learning pipelines, the machine-learned models, make perfect modules that encapsulate the model internals behind a simple and stable interface. The expected input and output formats of machine-learned models rarely change with updates or experiments. For example, a sentiment-analysis model will always expect a text fragment and return a numeric score. Typical deployments as microservices (see previous chapter) make it particularly easy to replace a model with an updated or alternative version in production without changing any other part of the deployed system. Stable automated machine-learning pipelines make it easy to update those models regularly.
- The typical design of machine-learning pipelines (whether automated or not) as sequential stages where each stage receives inputs from the previous stage easily maps to modular implementations with one or more modules per stage. This style of passing data between subsequent modules corresponds to the traditional pipe-and-filter architectural style in software architecture. Different data scientists can experiment with changes in different stages largely independently without requiring rework in other stages; for example, data cleaning can be improved without changing how data is exchanged with later stages of feature engineering or model training.
Yet in practice, while the default machine-learning setup is already amenable to modular implementations supporting change and experimentation, the devil is in the details. Pipeline stages may be separated by stable interfaces, but still interact; it is quite common that experimentation affects multiple stages, for example, combining changes to data cleaning with different feature encoding and different hyperparameters during model training. More importantly, decisions in the machine-learning pipeline interact with decisions in other (non-ML) parts of the system, such as how the data is collected or what infrastructure is available to scale training. None of the pipeline stages can be truly considered entirely in isolation without considering other stages and other parts of the larger system, as we will explore.
Pipeline Thinking
Data scientists tend to work in a very exploratory and iterative mode (see also chapter Data Science and Software Engineering Process Models), often initially on local machines with static snapshots of the data. Computational notebooks are popular tools to support this form of exploratory data analysis and programming, allowing to easily organize computations in cells, visualize data, and incrementally recompute cells without running the entire pipeline. This is a well supported environment to explore data and experiment on the way to produce the first model that can be deployed.
After the first model, most organizations will want to improve and update their models regularly. Many experience reports show that organizations often find it difficult to transition from exploratory data science code to robust pipelines that can be re-executed, that can ingest new data without modification, that integrate well with other parts of the system, and that may even run fully automated in production. As mentioned in chapter From Models to Systems, this transition requires adopting an engineering stance (potentially bringing in other team members with different skills), considering interactions with many non-ML parts of the system, and often bringing in substantial additional infrastructure, which has a nontrivial learning curve and may not pay off immediately (but promises long term benefits, see also the chapter on Technical Debt).
While modularity within the pipeline allows a lot of flexibility for experimentation and change, most pipeline stages interact with other parts of the system and should be considered as part of a larger system-wide design process. This includes infrastructure shared with non-ML parts of the system (e.g., data processing, distributed computing, scheduling, monitoring) but also the specific design of how data arrives into the pipeline and how the produced model is deployed and observed in production.
For developing robust pipelines, it is worth considering every stage of the machine-learning pipeline to consider how it relates to other parts of the system, what automation is possible, and how to implement the code to be robust to noise in the data and various error conditions, such as retrying when a download of data is interrupted. Documenting and testing of the infrastructure code and interfaces between pipeline components and other components in the system becomes important to avoid silent mistakes, where nobody may notice for weeks that model updates do not happen because a pipeline stage failed silently (e.g., the model becomes too big for the allocated memory in a container).
Stages of Machine-Learning Pipelines
Each stage of the pipeline has a different focus, different common tools and infrastructure, and different interfaces to other components in the system. In the following we survey typical considerations in each stage and approaches to automate the stage.
Data Collection
In production systems, data is rarely static. Often we can acquire more data over time to learn better models. If drift occurs, newer data better represents what should be learned. Observing users in the running system often produces valuable data that can be used to improve the system’s models, known as machine-learning flywheel (see chapter Introduction). The idea to continuously learn from how users react to the system is also described as the closed-loop intelligence pattern. Therefore, most systems should consider data as dynamic, should ensure that pipelines work on an evolving dataset, and should consider data collection as part of the system.
In many systems, there is lots of potential for automation in the data collection stage, moving from a working snapshot to using a continuously evolving dataset. The dataset would typically be stored in a database and fed into the pipeline through a query. The actual data collection code could be considered as part of the machine-learning pipeline or as a separate non-ML component in the system, but considering and explicitly designing data collection is important for designing an ML-enabled system that can evolve with new data over time.
When data is entered manually, whether by specialized individuals, paid crowd workers, or crowd sourced to the public, it is worth considering the data entry system, incentives and process integration for data entry, and data quality control as part of the system design. For example, if nurses in a hospital setting are supposed conduct additional blood tests and enter the test results into the patient’s medical record for later training, we need to design the medical records system with extra fields for this data entry, but also think about how to remind or require nurses to enter this data, without creating a heavy additional burden — we might consider ways to automatically scan test results or integrate the system analyzing the blood sample, to lower data entry efforts. In a crowd-sourcing setting, often some gamification mechanism (e.g., a leader board) is used to incentivize participation. For example, we might ask the general population to record pictures of bird nests in a dedicated app as part of a citizen science initiative and incentivize participants with badges, achievements, and rewards. It is also common to just pay professionals and crowd-workers for creating data, for example, to take pictures of certain objects in their household when training object detection models.
If data is collected from external APIs, such as weather data or employment statistics, we may plan for a robust system to continuously scrape the data. Monitoring the system to ensure that scraping actually succeeds and looking for discontinuities or anomalies in the data helps to identify problems quickly, before silent mistakes affect later stages of the machine-learning pipeline. This way, we can detect if the API is down, our scraping system is down, or the API has evolved and now publishes rain data using a different measurement unit or a different field in the exported JSON data.
In many cases, data is collected as telemetry from a running system, observing how users interact with the system and possibly how they respond to predictions (see chapter Quality Assurance in Production for longer discussion of telemetry design). For example, a search engine might log what users search for and which of the search results (and ads) they click on, or even where they usually travel. Telemetry data is often collected in the form of raw log files, often in vast quantities. Often purposeful sampling and substantial data transformation is needed to convert raw telemetry data into useful training data.
In summary, data collection can take many forms and most systems collect data continuously. The way data is collected differs widely from system to system, but always involves designing non-ML parts of the system in a way to bring data into the pipeline. System designers that anticipate the need for continuous data can build the system in a way to support data collection, to integrate humans into the process as needed, to automate and monitor the process as much as possible.
Data Labeling
If labels are needed for training, but are not directly provided as part of the data collection process, often human annotators provide labels. For example, humans may label sentences as toxic or not when building a toxicity detector for an automated moderation tool in social media. If new data is collected continuously, labeling cannot be a one-time ad-hoc process, but needs to be considered as a continuous process too. Hence the pipeline should be designed to support continuous high-quality labeling of new data.
Manual labeling tends to be laborious and expensive. Depending on the task, labeling needs to be done by experts (e.g., detecting cancer in radiology images) or can be crowdsourced more widely (e.g., detecting stop signs in photos). Most organizations will pay for labeling, by hiring labelers in their company or paying for individual labels on crowdworking platforms. In some settings, it may be possible to encourage people to label data without payment, for example, in the context of citizen science, gamification, or as part of performing another task — a classic example is how ReCAPTCHA asked humans to transcribe difficulty to read text from images as part of logging into a web page.
To support manual labeling, it is common to provide a user interface for labelers to (1) provide labeling instructions, (2) show a data point, and (3) enter the label. In this user interface, depending on the task, labelers could be selecting a value from a list of options, entering text, drawing bounding boxes on images, or many others. Machine-learned models can be used to propose labels that the human labelers then confirm or refine. To assure quality of the labels, a system will often ask multiple labelers to label a data point and assign the label only when it is confident that labelers agree. It may also use statistics to identify the reliability of individual labelers. If labelers often disagree this could be a sign that the task is hard or that the instructions provided are unclear. For all these tasks, commercial and open source infrastructure exists upon which a labeling interface and process can be built, such as Label Studio.
Programmatic labeling is an emerging paradigm in which labeling is largely automated in the context of weakly supervised learning. In a nutshell, weak supervision is the idea to merge partial and often low-confidence labels about data from multiple sources to automatically assign labels to large amounts of data. Sources can include manual labels and labels produced by other models, but most commonly sources are labeling functions that were written as code based on domain expertise to provide partial labels for some of the data. For example, when labeling text for a sentiment analysis model, a domain expert might write several small labeling functions to indicate that text with certain keywords is most likely negative. Each source is expected to make mistakes, but a weakly-supervised learning system, such as Snorkel, will statistically identify which source is more reliable and will merge labels from different sources accordingly. Overall, produced labels may be lower quality than those provided by experts, but much cheaper. Once the system is set up and the labeling functions are written, labels can be produced for very large amounts of data at marginal cost. Experience has shown that more data with lower-quality labels is better than less data with high-quality labels for many tasks. Weak supervision shifts effort from a repetitive manual labeling process to an investment into domain experts and engineers writing code for automated labeling at scale.
Again notice the potential for automation and the need to integrate with non-ML parts of the system, such as a user interface for labeling data. In a system that is continuously updated and is regularly re-trained with new data, having a robust labeling process is essential. Building and monitoring a robust infrastructure will provide confidence as the system evolves.
Data Cleaning and Feature Engineering
The data cleaning and feature engineering stages (sometimes collectively described as data wrangling) typically involve various transformations of the data, such as removing outliers, filling missing values, and normalizing values. Many transformations follow common patterns routinely performed with standard libraries, such as normalizing continuous variables between 0 and 1, one-hot encoding of categorical variables, or using word embeddings. Other transformations may be complex and domain-specific, such as extracting a person’s average night-time heart rate from sensordata or computing similarity with other training data. In addition, features are often created based on other machine-learned models, for example, one predicting a person’s parental status from their shopping history. A lot of creative work of data scientists is invested in these stages, often requiring substantial domain knowledge. Data scientists often need to work together closely with domain experts, who may have very little machine-learning knowledge.
In early exploratory stages of a project, these transformations are often performed with short code snippets or even by manually editing data. Transformation code in exploratory stages needs to work for the data considered, but is rarely tested for robustness for potential future data. For example, if all current training data has well formatted time entries, code is likely not tested for whether it would crash or silently produce wrong outputs if a row contained a time entry in an unexpected format (e.g., with timezone information). Researchers also found that such transformation code often contains subtle bugs even for transformations covered in the initial dataset, which are hard to notice because machine-learning code often does not crash as a result of incorrect transformations but simply learns less accurate models. When the pipeline is to be automated, it is worth seriously considering testing, error handling, and monitoring of transformation steps (see chapter Infrastructure Quality).
Some projects need to process very large amounts of data and transform data in nontrivial and sometimes computationally expensive ways. In such settings, infrastructure for distributed data storage and distributed batch or stream processing, as we will discuss in chapter Scaling the System, will become essential. Data engineers can provide relevant expertise for all non-trivial data management and transformation efforts.
In an organization it is also not uncommon that multiple models are based (partially) on the same datasets. For example, a hotel booking service may train multiple models for predicting different aspects about their customers (e.g., preferred times, preferred locations, preferred room size, preferred price range) based on their metadata and booking history. Sharing and reusing transformation code across machine-learning pipelines can help multiple teams to avoid redundant work and benefit from the same improvements. Moreover, if features are expensive to compute but reused across multiple predictions or multiple models, it is worth caching them.
As discussed in the previous chapter on Deploying a Model, the machine-learning pipeline that trains the model and the model inference service that uses the model need to perform the same transformations for feature encoding. Hence sharing the same code during training and inference avoids inconsistencies. Feature stores, discussed in the same chapter, provide dedicated infrastructure for sharing data transformation code, facilitating reuse, discovery, and caching.
In summary, stage also interacts with other parts of the system, be it other machine-learned models, the model inference code, data processing infrastructure, or feature stores. Code for data cleaning and feature engineering that is often written in early exploratory stages usually benefits from being rewritten, documented, tested, and monitored when used in an automated pipeline.
Model Training
Code for model training may be as simple as two lines of code to train a decision tree with scikit learn or a dozen lines of code for configuring a deep-learning job with Keras. This code is naturally modular (receiving training data, producing a model) and can be scheduled as part of a pipeline. This code may be frequently modified during experimentation to identify the right learning algorithm and hyperparameters.
As other stages, model training is also influenced by other parts of the system. Decisions are often shaped by requirements arising from system requirements and requirements from other components of the system, including requirements about prediction accuracy, training latency, inference latency, or explainability requirements, which we discussed in chapter Quality Attributes. Experimentation in model training, including trying variations of machine learning algorithms and hyperparameters, is often explored in concert with experimentation in feature engineering.
For machine-learning algorithms that require substantial resources during training, training jobs need to be integrated into the processing infrastructure of the organization. Distributed training and training on specialized hardware are common in practice (briefly discussed in chapter Scaling the System) and introduce new failure scenarios that need to be anticipated. To achieve reliable pipelines in such contexts that still enable updates and experimentation with reasonably frequent iteration cycles, investment in nontrivial infrastructure is necessary, including distributed computing, failure recovery and debugging, and monitoring. For example, in the paper “Applied machine learning at Facebook: A datacenter infrastructure perspective”, Facebook has reported the substantial data center infrastructure they provisioned and how they balance qualities such as cost, training latency, model-updated frequency needs from a business perspective, and ability to recover from outages. To a large degree this kind of infrastructure has been commodified and is supported through open-source projects and available as cloud service offerings.
Model Evaluation
We will discuss evaluation of machine-learned models in detail in later chapters. Typical evaluations of prediction accuracy typically are fully automated based on some withheld labeled data. However, model evaluation may also include many additional steps, including (1) evaluation on multiple specifically curated datasets for certain populations or capabilities (see chapter Slicing, Capabilities, Invariants, and other Testing Strategies), (2) integration tests testing the interaction of the model with the rest of the system (see chapter Integration and System Testing), and (3) testing in production (A/B testing, see chapter Quality Assurance in Production).
Again, automation is key. Evaluation activities may need to be integrated with other parts of the system, including data collection activities and production systems. It can be worth considering working with an external quality assurance team.
Model Deployment
As discussed in the previous chapter Deploying a Model, there are many design decisions that go into where and how to deploy a model. The deployment process can be fully automated, such that the pipeline will automatically upload the model into production once training is complete and the model has passed the model evaluation stage. To avoid degrading system quality from poor models, fully automated systems usually ensure that models can be rolled back to a previous revision after releases and often rely on a canary release infrastructure, where the model is initially released only to a subset of users, monitoring whether degradations in model or system quality can be observed for those users (we will discuss this further in the chapter Quality Assurance in Production).
Investing in a robust and automated infrastructure for deployment can enable fully automated regular model updates and provides opportunities for data scientists to experiment with model versions more rapidly.
Model Monitoring
Monitoring is not part of the process of producing a model but often depicted as a final step in machine-learning pipelines. As discussed previously in the introductory chapter Architectural Components in ML-Enabled Systems, model monitoring can technically be considered as a separate component from the pipeline component that produces the model. The monitoring component observes the model in production and may trigger an update with the pipeline component if it notices problems in production.
Investing in system and model monitoring helps to understand how the system is doing in production and facilitates experimentation and operations generally, see chapters Planning for Operations and Quality Assurance in Production.
Automation and Infrastructure Design
As discussed, each stage of the machine-learning pipeline can be carefully planned and engineered, and most stages will interact with other components of the system. Typically all stages together can be automated to run end-to-end triggered either manually, in regular intervals by a scheduler, or when meeting some threshold in a monitoring system.
Modularity, Dependencies, and Interfaces
In the literature, a pipeline consisting of one or multiple ad-hoc scripts is referred to as a big-ass script architecture antipattern. Except for the most trivial pipelines, dividing the pipeline into individual modules with clear inputs and outputs helps to structure, test, and evolve the system. In many cases, each stage can be represented by one or multiple modules.
Such decomposition helps to separate concerns, where different team members can work on different parts of the system, but still understand how the different parts interface with each other. Smaller components with clear inputs and outputs are also easier to test than code snippets deep within a larger block of code.
Documenting interfaces and dependencies between different modules as well as qualities of each component is important but often neglected in practice. Most modules of machine-learning pipelines have relatively simple interfaces, for example, receiving tabular data and producing a model. While data formats at interfaces between components are easy to describe with traditional data types and schemas, describing distributions of data or assumptions about data quality is more challenging. For example, what assumptions can the feature engineering about how the data was collected and subsequently cleaned? Does the data collection process sample fairly from different distributions? What confidence is there in the labels and has it changed recently?
As with documenting models discussed in the previous chapter, practices for documenting data sets and data quality more broadly are not well developed and are still emerging. Several recent proposals suggest means to document datasets, such as Datasheets for Datasets and Dataset Nutrition Labels and methods to test and monitor data quality emerge (see chapter Data Quality). In robust pipelines, proactive monitoring to notice shifts in distribution of data exchanged between pipeline components can be an effective means to detect problems early.
In addition to components emerging directly from the machine-learning pipeline, it is equally important to document interfaces and expectations on other components in the system, such as the data logging in some user interface component relied upon by the data collection stage, the crowd-sourcing platform used for labeling data, or the cloud infrastructure into which the trained model is uploaded for deployment. Most of these can again be described using traditional interface documentation. Documentation at these interfaces is particularly important when different teams are responsible for the components at either side of the interface, such as when the web frontend team needs to work with the data engineering team to collect the right telemetry data for future training.
Code Quality and Observability
With a move to engineering a robust machine-learning pipeline and automation, code quality becomes much more important than for initial exploration and model prototyping. As the pipeline will be executed regularly in production with limited human oversight, monitoring becomes more important to notice problems early.
Code of pipeline components should undergo the same quality assurance steps as all other software components in a system. This may include unit testing, integration testing, static analysis, and code reviews. For example, the implementation of each stage can be tested in isolation, making sure that it is robust to missing or ill-formatted data and does not just silently fail if problems occur — we will discuss how to test for robustness in chapter Infrastructure Quality. Code reviews can help adopt consistent coding convention across teams, catch subtle bugs, and foster collective code ownership where multiple team members understand the code. Static analysis tools can catch common kinds of bugs and enforce conventions such as code style.
As code frequently varies with experimentation, it is worth adopting suitable strategies to track experimental code changes, such as developing in branches or using feature flags (rather than simply copy pasting cells in a notebook). It is usually a good idea to explicitly track experiments and merge or remove the code once the experiment concludes.
To foster reproducibility, explicitly declaring and managing library dependencies can ease installation and avoid compatibility issues. Containerizing components is increasingly common to ease deployment for operators, as we will discuss in chapter Planning for Operations.
Finally, monitoring should be extended into the machine-learning pipeline and its components itself, allowing data scientists and operators to observe how the model is trained (e.g., duration and success of each step, current use of cloud resources) and detect failures and anomalies.
Workflow Orchestration
Finally, for full automation, the various components of the pipeline within the system need to be executed. The execution can be triggered by various different events, such as a scheduler for regular re-training, a monitor noticing a decline in model quality, a new commit changing the pipeline code, or a developer manually triggering a run.
Executions of pipelines can be nontrivial, with multiple stages executed in sequence or in parallel, often storing large intermediate results at the interface between components in files or databases, and sometimes executing different components on different hardware in large distributed systems. When stages fail, there are several error handling strategies, such as simply retrying the execution or informing a developer or operator.
The antipatterns pipeline jungles, undeclared consumers and big-ass script architecture all describe situations where the flow of data between components is not apparent. A developer may need to execute a number of scripts in a specific order and make sure that the right data is written into the right files. Some scripts may depend on specific files that nobody remembers how they were once created. Undocumented flows can lead to problematic and wasteful situations, for example, when the feature engineering stage relies on user data in a specific file, but that file is updated by another script so that running these two steps out of order will encode features on outdated data. Without documentation, it may be difficult to reconstruct which steps need to be run in what order to produce certain files. This shows how infrastructure for describing flows and automating the execution of entire pipelines is a valuable asset.
It is common to adopt dedicated infrastructure to orchestrate more complex pipeline implementations. In such infrastructure, configurations or code typically describes how multiple components are connected and in which order their computations are performed. The infrastructure can then identify what components need to be executed and schedule the execution on various available machines. There are many open-source and commercial infrastructure tools available in the MLOps community, such as Apache Airflow, Netflix’s Metaflow, Tensorflow Extended (TFX) and Kubeflow Pipelines. Commercial cloud-based machine-learning systems such as AWS Sagemaker or Databricks tend to provide substantial infrastructure for all parts of the pipeline and their coordinated execution.
In addition to automating and scheduling the various pipeline stages, a common goal is to ensure that the entire pipeline is versioned and reproducible, including source data, derived data, and resulting model. This also helps to keep different model versions apart and perform computations incrementally only as needed, while data scientists explore alternatives and experiment in production. We discuss this separately in chapter Versioning, Provenance, and Reproducibility, in the context of responsible machine learning practices.
Design for Experimentation
While building automated, robust pipelines requires a substantial engineering investment, it can pay off for data scientists in terms of improved ability to experiment. Workflow orchestration systems can not only support releasing models into production, but also to schedule multiple offline experiments in which different versions of a model are trained and evaluated. Automated and robust pipelines remove manual steps and common necessary error handling and debugging that otherwise slow down experiments. Investments into telemetry, system monitoring, and A/B testing infrastructure will allow data scientists to evaluate models in production (as we will discuss in chapter Quality Assurance in Production), providing data scientists with fast feedback from real production data when exploring ideas for model improvements.
Summary
In this chapter, we provided an overview of the design and infrastructure considerations of different stages of a typical machine-learning pipeline and how they interact with other ML and non-ML components of a typical ML-enabled system. When transitioning from an initial prototype to a production system, building automated, robust machine-learning pipelines is often essential to allow flexibility for updates and experimentation. The necessary infrastructure may require a substantial upfront engineering investment, but promises long-term payoffs through faster experimentation and fewer problems in operations.
We are not aware of many codified design patterns in this space, beyond the closed-loop intelligence pattern, the feature store pattern and occasional mentions of workflow pipelines for orchestration and model versioning as a pattern. However, poorly engineered machine-learning pipelines are often characterized with antipatterns, such as pipeline jungle, dead experimental paths, big-ass script architecture, undeclared consumers, and set and forget.
Further Reading
- Short experience report describing the challenges of adopting a pipeline-oriented engineering mindset in many organizations: O’Leary, Katie, and Makoto Uchida. “Common problems with Creating Machine Learning Pipelines from Existing Code.” Proc. Third Conference on Machine Learning and Systems (MLSys) (2020).
- Extensive discussions of different deployment strategies and telemetry strategies: Hulten, Geoff. “Building Intelligent Systems: A Guide to Machine Learning Engineering.” Apress, 2018
- Interview study on data quality challenges in ML-enabled systems, including design discussions related to planning for data collection: Sambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. ““Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI”. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15. 2021.
- Study of public data science code in notebooks, finding many subtle bugs in data transformation code: Yang, Chenyang, Shurui Zhou, Jin L.C. Guo, and Christian Kästner. “Subtle Bugs Everywhere: Generating Documentation for Data Wrangling Code.” In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021.
- Short paper motivating the use of automated code quality tools for computational notebooks after observing common style issues in public data science code: Wang, Jiawei, Li Li, and Andreas Zeller. “Better code, better sharing: on the need of analyzing Jupyter notebooks.” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, pp. 53–56. 2020.
- For programmatic labeling and weakly supervised learning, the Snorkel system is a good entry point, starting with the tutorials on the Snorkel web page or this overview paper: Ratner, Alexander, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. “Snorkel: Rapid training data creation with weak supervision.” Proceedings of the VLDB Endowment, 11(3), 269–282, 2017.
- Discussion of the software and hardware infrastructure for learning and serving ML models at Facebook, including discussions of quality attributes and constraints that are relevant in operation, including cost, latency, model-updated frequency needs, large amounts of data, and ability to recover from outages: Hazelwood, Kim, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy et al. “Applied machine learning at facebook: A datacenter infrastructure perspective.” In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 620–629. IEEE, 2018.
- The MLOps community has developed a vast amount of tools and infrastructure for managing machine-learning pipelines. For a good introduction see https://ml-ops.org/ and https://github.com/visenger/awesome-mlops and https://github.com/kelvins/awesome-mlops
- Book discussing several common design solutions for problems in and across various stages of the machine-learning pipeline, including feature engineering, model training, and achieving reproducibility: Lakshmanan, Valliappa, Sara Robinson, and Michael Munn. Machine learning design patterns. O’Reilly Media, 2020.