Versioning, Provenance, and Reproducibility in Production Machine Learning

This post is a summary of the “versioning, provenance, reproducibility” lecture of our Machine Learning in Production course; for other chapters see the table of content..

In production systems with machine learning components, updates and experiments are frequent. New updates to models may be released every day or every few minutes, and different users may see the results of different models as part of A/B experiments or canary releases (see QA in Production lecture). It may be hard to later figure out which model made a certain prediction or how that model was produced.

Motivating Example: Debugging a Prediction Problem

Let’s imagine we are responsible for investigating a potential problem with our credit-scoring prediction algorithm for credit card applications, possibly under pressure while we are publicly shamed in viral tweets about how biased our system is:

Hopefully we had already tested the model for bias (see fairness chapter) and had considered strategies to mitigate wrong predictions (see planning for mistakes chapter), but we probably still should try to figure out what went wrong in this specific case.

Now, to investigate the issue, we need to reproduce it first. Hopefully, we have stored the inputs provided by the customer in some database or log file, but worst case, the customer could provide us the input data (the credit card application data) again.

Now the first question is, if we run the same input again, do we get the same result? That is, are the results reproducible?

If they are not, why not? Is there nondeterminism in the feature extraction, the ML model, or some other parts of the service? Or maybe the model has been updated since?

Can we identify what version of the model was used back at the time when the customer tried first? If we are running experiments, can we reproduce whether the customer was part of any specific experiment and may have seen a different model? Do we still have that model? Did we record how we tested it and who approved releasing it?

Assuming we find that model and we can reproduce the prediction and it is indeed wrong, what now? We might ask about model provenance: What training data was used? What was the exact feature extraction code? What version of the machine-learning library was used? What hyperparameters were used? If we repeat the training, do we get the same model?

If we identify the exact dataset used for training, what do we know about data provenance: Do we know where that data came from? Who was involved in cleaning or manually modifying the data? What processing steps did the data go through, with what version of what code?

If we manage all of that for one model and one input, can we do the same for systems that combine inputs from multiple sources and combine results from multiple models? For example, let’s assume the system combines three models updated with different frequency and running different A/B tests, input data from multiple sources, some of which are volatile and may change minute to minute.

Image for post
Image for post
Multiple data sources and ML models are involved in making a single decision about a credit card offer.

Overall, we may want to be able to reproduce any prediction, recreate any model, and trace back the origin and lineage of all data involved in training any of these models.

That is, typical debugging and accountability questions, beyond interpretability, that we need to answer before we can even locate the problem or even fix it are:

  • Can we reproduce the problem?
  • What were the inputs to the model?
  • Which exact model version was used?
  • What data was the model trained with?
  • What learning code (cleaning, feature extraction, ML algorithm) was the model trained with?
  • Where does the data come from? How was it processed and extracted?
  • Can we reproduce the model?
  • Were other models involved? Which version? Based on which data?

Mature organizations will invest in infrastructure to collect and store all this data, sometimes at significant engineering costs. In contrast, deploying a model by simply overwriting an older model on some server in production is a sure way to lose track and having severe problems whenever trying to debug or audit parts of a system after a problem.

Version Everything

Nowadays, versioning source code is a standard practice for software developers, who have grown accustomed to committing their work incrementally, usually with meaningful commit messages, to a version control system like Git. In many settings individual changes are even reviewed.

Data scientists are often criticized for being less disciplined with versioning their experiments, especially when using computational notebooks. This can often be explained by their more exploratory working style (see process lecture), where there are often no obvious milestones and lots of code is not intended to be permanent (which is also criticized by some as losing much history of what goes into decisions for a specific model).

When building machine learning models, we need to worry about versioning of data, pipeline code, and models.

A note on terminology: Revisions refer to versions over time, where one revision succeeds another, traditionally expressed through increasing version numbers in a version control system. Variants refer to versions that exist in parallel, for example, models for different countries or models still to be evaluated as part of an A/B test; traditionally they are stored in branches or different files in a repository. Version is the general term that refers to both revisions and variants. Releases are select revisions that are often given a special name, often a subset of all versions. Here we care about all forms of versioning.

Versioning large datasets

Whereas current version control systems work well for typical source code repositories with many small files, they face challenges with large datasets. There are a number of different versioning strategies that can be considered:

  • Storing copies of entire datasets: Each time a dataset is modified a new copy of the entire dataset is stored. Small changes to large files cause a lot of additional storage demand, but it’s easy to access individual versions of a file. This strategy can be easily implemented by having a naming convention for files in a directory and it is used internally in the Git version control system.
  • Storing deltas between datasets: Instead of storing copies of the entire file, it is possible to store only changes between versions of files. Standard tools like diff and patch can be used to identify changes between files and record and replay changes. If changes are small compared to the size of the file, this strategy is much more space efficient, but restoring a specific past version may require applying a lot of changes to the original dataset. This strategy has long been used internally by many version control systems, including RCS and Mercurial.
  • Offsets in append-only datasets: If data is only ever changed by appending, it is sufficient to remember the file size (offset) to reconstruct a file at any past moment in time. This is the simplest form of versioning, but requires append-only data .This can be used for naturally append-only data, say log files and event streams, or when storing data in event sourcing style (aka immutable data; see big data lecture). This could also be the offset in a streaming system like Apache Kafka, assuming all history is kept. Internally many databases store a log in this form, though they usually do not persist such log permanently.
  • Version individual records: Instead of versioning an entire dataset at once, it is possible to version individual records. Especially in key-value databases, it is possible to version the values for each key (storing either copies or deltas). For example, Amazon S3 can version individual buckets. Locally, this can also be achieved by creating a separate file for each record and using Git to version a directory with all records as individual files resulting in an individual history per record (with some overhead for versioning the directory structure). A number of databases can similarly track the edit history and even attach additional provenance information to each change, such as, who was the user responsible for the change. This strategy of versioning individual parts can be applied to many forms of structured data.
  • Version pipeline to recreate derived datasets: When datasets were derived from other datasets through transformation steps, for example through data cleaning and feature extraction stages in a machine-learning pipeline, it may be more efficient to recreate that data on demand rather than to store it. Conceptually, the derived dataset can be considered as a view on the original data that may be cached for efficient computation now, but can be recreated if needed. This strategy requires versioning the original data and all transformation steps involved in producing this derived dataset, and it requires that the involved transformation steps are deterministic.

Which strategy is used will depend on the size and structure of the data, how and how often it is updated, and how often old versions are accessed. Engineers will consider tradeoffs primarily among the different strategies in terms of storage space required and computational effort.

For example in the credit scoring scenario (see flow chart above), we may adopt different strategies for different datasets. The purchase history is probably an append-only data structure, where versioning with offsets is cheap and efficient. In contrast, market conditions (let’s assume we track stock and index prices) will all vary from day to day with little hope for small deltas, so that we can just store copies of the entire dataset. Customer data (let’s assume this is information they provide in their profile, such as name, income, and marriage status) will not change frequently and not for all customers at once and the per-customer data is fairly small; here we might just version individual customer records separately.

As datasets become large, they are often no longer stored locally but stored in a distributed fashion, often in cloud infrastructure. The versioning strategies above are largely independent of the used storage mechanism in the backend.

Usually it is a good idea to version available metadata together with the data. This could include schema information (which may evolve with future versions of the data), but also license, provenance, owner, and other forms of metadata.

Versioning models

Models are usually versioned as large binary data files. Most models are mainly serialized as lots of model parameters (e.g., coefficients in linear models, matrices for weights in neural networks, decisions in random forests). With most machine-learning approaches, small changes to data or hyperparameter can lead to changes in many model parameters, so the learned models usually are very different — that is, different models do not share structures or parameter values that would allow for meaningful storage or analysis as deltas between model versions. Any system that tracks versions of binary objects will suffice, in line with the “storing copies of the entire dataset” strategy above.

Versioning pipelines

Image for post
Image for post

A model is typically created by feeding data and hyperparameters into some learning code, which we call “pipeline” because it may contain many stages for cleaning data, extracting features, training models, and hyperparameter optimization. Ideally, we want to version all steps that go into producing a model, so that we can track which versions of the individual parts went into creating the model (model provenance).

Pipeline code and hyperparameters can be expressed in normal code and configuration files and they can be versioned like traditional code. When versioning pipeline code, it is important to also track versions of involved frameworks or libraries to ensure reproducible executions.

To version library dependencies, the common strategies are to (1) use a package manager and declare versioned dependencies on libraries from stable repository (e.g., requirements.txt and pip or package.json and npm) or (2) commit all dependencies to the code repository (called “vendoring” in the Go community).

Optionally, it is possible to package all learning code and dependencies into virtual execution environments to ensure that environment changes are also tracked. Docker containers are a common solution, where both the Dockerfile manifest can be versioned and the actual container itself.

When versioning dependencies with a package manager, avoid floating versions for dependencies (e.g., 2.*) but instead specify specific releases (e.g., 2.3.1). Also test the pipeline code on an independent machine, for example, in a container or a continuous integration server, to ensure that no dependencies are missing.

Putting the pieces together

When all pieces can be versioned independently, we can connect them with metadata. For example, we can track that model version v124 was built from data version v1553 and pipeline version v4.1 and hyperparameter configuration file v10. This could all be stored as part of the metadata of the model.

Once models are deployed, it is useful to know which model version handles specific inputs during inference, as part of log files or audit traces. This way, we can track the model version responsible for every output, over time, and in the presence of A/B tests. So we might have log files that look like this, from which we can track back to the model and from there to the corresponding training data, hyperparameters, and pipeline code:

<date>,<model>,<model version>,<feature inputs>,<output>
<date>,<model>,<model version>,<feature inputs>,<output>
<date>,<model>,<model version>,<feature inputs>,<output>

It is usually a good idea to create such logs for every single model in the system, to be able to also debug interactions among multiple models or interactions between ML and non-ML components.

Notice that training and inference pipelines often share a lot of code around data cleaning and feature engineering. It is a very good idea to associate the model with the corresponding code version, to ensure that the same code is used at inference time that was used at training time.

Recently, many tools have been developed to help with these steps, often under the label of MLOps tools. They can help tracking versions of data, pipelines, and models, often managing pipelines explicitly as a sequence of processing steps. They also often help record further metadata, such as training time or evaluation results (e.g., prediction accuracy, robustness), see the discussions of DVC and MLflow below for examples.

Bonus: Versioning in computational notebooks

Data scientists using computational notebooks do not commonly use Git for versioning as the exploratory workflow does not align well with traditional version control practices of committing cohesive incremental steps. In addition, the file format of popular notebooks that includes output data and images together with the cells’ code in a json format does not lend itself nicely to textual differences and many traditional version control and code review processes.

Researchers have recently investigated more fine-grained history recording mechanisms and mechanisms to search in those histories, for example with the Verdant tool:

Track Data Provenance

Tracking the origin of data and the processing or editing steps applied to it has a long history, mostly in the database community, under labels like data provenance and data lineage. The key idea is to track every step from the origin of the data, to how it was edited and by whom. This applies to all processing steps in a machine learning pipeline, including data collection/acquisition, data merging, data cleaning, feature extraction, learning, or deployment. Missing visibility into data dependencies and histories is sometimes called “visibility debt” in the context of technical debt discussions (see process lecture).

When learning models with machine learning techniques, often input datasets are used that are quite removed from their original source, say, collected in the field, supplemented with data from other sources, cleaned in various manual and automated steps, and prepared with some feature extraction. At the stage where some data scientists may experiment with different network architectures, all that history and all the decisions that may have gone into the dataset, and also all involved possible biases may be lost.

Analogous to data provenance, also the terms feature provenance and model provenance are used. Feature provenance refers to tracing how features in the training and inference data were extracted, that is, mapping data columns to the code version that was used to create them. Model provenance refers to tracking all inputs to a model, including data, hyperparameters, and pipeline code with its dependencies, as discussed in model versioning above.

Documenting the origin and the applied transformation steps can be done manually, as in metadata notes or drawn architectural data flow diagrams. Data transformation steps can be cleaning steps, feature engineering steps, or even running predictions on some data with another model. As with all documentation, there is a high risk that it becomes outdated and it will likely not capture the involved versions of datasets and processing steps.

Image for post
Image for post
Simplified data flow explanation manually written for one of our own research projects to detect toxic issues on GitHub and report them with a bot, which would continuously monitor and analyze incoming issues and comments with a stream-processing platform. Note the nontrivial data flow through many different components and data sets. Purple boxes are ML models that annotate or classify data, yellow boxes are non-ML components to process data, blue boxes represent intermediate results in Kafka streams.

It is worth investing in automating some of this lineage tracking. This could be a full-blown system-wide solution, where every step is recorded, or it could be ad-hoc tools where each transformation step reads and writes metadata from their inputs and outputs (e.g., each file has a version number and a processing history log attached to it, and each processing step copies that metadata to the new file and adds information to the processing history).

For system-wide tools, a large number of tools for modeling and executing data processing pipelines (“task orchestration”) automatically track metadata for all processing steps. It requires that developers write their pipelines and transformations in a specific format that can be executed by those tools, but then the tools take care of scheduling and provenance tracking and versioning (think of this as similar to how a map-reduce framework requires processing steps to be written in a specific form, but then automates the scheduling and error handling). A good example of such a tool is DVC (for data versioning control).

DVC

In DVC, all processing steps are modeled as separate steps (“stages”) of a pipeline. Each step is implemented by an executable (often a python script) that is called with some input data and produces some output data. The steps and data dependencies are described as pipelines in some internal format, such as this example from the DVC documentation:

stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- data/clean
params:
- levels.no
outs:
- features
metrics:
- performance.json
training:
desc: Train model with Python
cmd:
- pip install -r requirements.txt
- python train.py --out ${model_file}
deps:
- requirements.txt
- train.py
- features
outs:
- ${model_file}:
desc: My model description
plots:
- logs.csv:
x: epoch
x_label: Epoch
meta: 'For deployment'
# User metadata and comments are supported

Both implementations of the steps and the pipeline description are versioned in a Git repository. The data can be stored and versioned in Git as well, but typically some external versioned data source is used, stored on HDFS, Amazon S3, Google Cloud Storage, or local or remote file systems, and DVC then just keeps pointers to external data and their versions in the Git repository.

DVC then provides command line tools to execute individual pipeline steps or the entire pipeline at once. It will run the executable of each step according to their dependencies and track the versions of all inputs and outputs as part of metadata stored in the Git repository.

As long as all processing steps are executed within the pipeline with the DVC frontend, all steps are versioned and metadata of all data transformations is tracked.

Google’s Goods

In their Goods paper, a team of Google engineers describes another mechanism to track provenance of data as it is processed by many different steps, without having to require developers to explicitly model their pipelines in some tool. This strategy recovers provenance information from log data produced within Google’s infrastructure.

While Google provides many different ways of storing and processing data, essentially all are executed in a cloud environment that produces log files for all activites. The Goods project looks at log files of different processes that have been executed (from simple scripts to big map reduce or tensorflow jobs), identifies which files were read and produced by each job, and then reconstructs a graph like the one above between the different datasets — stored as metadata with the datasets. As executable code is essentially always versioned when executed in Google’s infrastructure, the versions of the executables can also be tracked; common patterns with storing and versioning data can also often track data versions, as long as data is not directly deleted or overwritten.

Similar provenance tracking can be done on a single machine with Linux system logs, but the key here is to scale this task to large distributed systems, which requires more visibility into accessing files at many different locations.

Notice the key design tradeoffs between the DVC and the Goods and approach, where DVC requires explicit modeling of all steps and Goods recovers (mostly) what code has been executed. Goods is transparent to developers, who do not need to do extra steps or even know about provenance tracking, but it requires an investment into a uniform infrastructure.

Tracking Data Science Experiments

Data scientists routinely experiment with different versions of extracting features, different modeling techniques, different hyperparameters and so forth (see process lecture). Some experiments are ad-hoc within a notebook and for some, more explicit tracking may be useful, especially if exploring a larger space of modeling options and their interactions.

Here it is useful to track information about specific experiments and their results. This is similar to versioning different models, but additionally focuses on keeping and comparing results, often visualized in some dashboard.

For example, the MLflow screenshot below shows results from a large number of training runs, filtered by those with an R2 score about 0.24, each listed with corresponding hyperparameters of that run, source version of the pipeline, and recording training time. This dashboard is useful for comparing results of training with different hyperparameters, but it could also be used to compare different versions of the pipeline or training with different datasets.

Image for post
Image for post
From Matei Zaharia. Introducing MLflow: an Open Source Machine Learning Platform, 2018

The MLOps community has developed a large number of platforms, open source and commercial, that make it easy to log and track experiments and show results in dashboards; for example, MLflow, ModelDB, Neptune, TensorBoard, Weights & Biases, Comet.ml, and many many more.

Also DVC discussed above can easily run and track experiments if evaluation results are written as metrics or reports in some output file. DVC specifically provides tools to run pipelines with different parameters tracked through Git branches and tools to query for branches that meet certain metric thresholds, similar to the visual dashboard above. Some external dashboard can then visualize all results, each tracked back to an exactly versioned execution of the pipeline. MLflow pursues a quite different design.

MLflow, ModelDB, and Neptune

MLflow is a popular framework for tracking experiment runs (it also has more features for model registration and deployment). ModelDB and Neptune have many similar features and concepts.

They are all less opinionated than DVC and more flexible, but require developers to be more disciplined about how they version data and pipelines. Instead of enforcing a pipeline or dataflow framework, both provide commands to report results and artifacts that can be integrated in all kinds of execution environments, including computational notebooks, Python scripts, and any other training and experimentation environment.

It is the developers responsibility to report versions, metrics, and files to the framework through API calls in their code. The frameworks then record those results and show them in a dashboard as shown above. For example, here is a hello-world example, adapted from the ModelDB documentation, training a model with two different hyperparameters. Note how the developer adds library calls to explicitly state when an experiment starts, then explicitly log the hyperparameter, then explicitly log the version of the dataset, and then explicitly log the corresponding accuracy result and even upload the model:

from verta import Client
client = Client("http://localhost:3000")

proj = client.set_project("My first ModelDB project")
expt = client.set_experiment("Default Experiment")

# log the first run
run = client.set_experiment_run("First Run")
run.log_hyperparameters({"regularization" : 0.5})
run.log_dataset_version("training_and_testing_data", dataset_version)
model1 = # ... model training code goes here
run.log_metric('accuracy', accuracy(model1, validationData))
run.log_model(model1)

# log the second run
run = client.set_experiment_run("Second Run")
run.log_hyperparameters({"regularization" : 0.8})
run.log_dataset_version("training_and_testing_data", dataset_version)
model2 = # ... model training code goes here
run.log_metric('accuracy', accuracy(model2, validationData))
run.log_model(model2)

Notice how it’s the user’s responsibility to instrument their own code and report relevant hyperparameters and versions of data or libraries from training code with logging statements, whereas DVC would automate such tasks but would require developers to work entirely within the DVC infrastructure.

While the reporting core functions and style is similar for all these tools, most come with lots of extra infrastructure for dashboards, running and scheduling experiments, and packaging and deploying models or integrating with many other MLOps tools.

Reproduciblity

A final question that is often relevant for debugging, audits, and also science more broadly is to what degree data, models, and decisions can be reproduced. At a technical level, it may be useful to be able to recreate a specific model for some debugging task, and for the research community it is often important (but challenging) to reproduce the results published in papers.

There are multiple separate questions here that can technically be distinguished:

  • Classically, reproducibility in science is the ability to repeat an experiment with minor differences to the original experiment, but still achieving the same qualitative result. That is, specifics can and often should differ, but the overall result should be consistent. For example, if a psychology experiment finds that left-handed people are better at learning to juggle than right-handed people, then other researchers should be able to repeat the experiment with a different population and maybe with a similar but slightly different experimental setup (e.g., different instructor, different teaching method, different teaching duration). The result does not need to be exactly the same, but it should be consistent. Establishing consistent findings through repeated executions of an experiment in slightly different settings, we ensure that the result is stable and not due to a random effect or problem in the original experiment. To enable third-party replication, it is important to describe the experiment in sufficient detail.
  • In contrast, replicability is the ability to exactly reproduce results by running exactly the same experiment. For example, given a specific dataset of the juggling experiment, do we come to exactly the same conclusion to what factors are statistically significant and how big the effects are? Replicability ensures that the specific steps are described (or sources published) and that there is no nondeterminism in the analysis procedure.

In science, reproducibility is more important than replicability, because one wants to gain confidence of finding consistent results in different setups, not just make sure that one can download and run the analysis scripts. In production machine learning settings (beyond publishing papers), arguably replicability is more important in that we can exactly recreate certain artifacts for debugging.

To achieve exact replications, it is important to version all steps (as discussed above) such that models can be recreated from the data and those models yield the same prediction result on some inputs. In addition, it is important that there is no nondeterminism in the pipelines.

Side note: In open-source settings, exact replications are useful to secure against a specific kind of security attack, that is known from the community of reproducible builds in open source (pioneered in Debian): If an attacker has access to a training server, they might be able to inject wrong behavior or back doors into a model. However, if multiple parties can independently produce the exact same model, we have confidence that nobody has injected wrong behavior.

Nondeterminism

The inference step of a machine learning model to make a prediction is deterministic for most machine-learned models. That is, given the same input, a given decision tree, linear model, or neural network always produce the same result. There is no randomness or environment input in the prediction task.

There is nondeterminism in many machine learning algorithms though. For example, parameters of neural networks are initialized with random values, so training repeatedly on the same data will not yield exactly the same model. In distributed learning (see big data lecture), timing differences in distributed systems can affect the learned parameters due to the involved merging strategies. Similarly, random forests make use of randomness for the ensemble techniques, producing different forests when trained repeatedly. Nondeterminism from random numbers can be controlled by explicitly seeding the random number generator used, so that repeated runs use the same sequence of random numbers.

Beyond random numbers during learning, there are other sources of nondeterminism when interacting with the environment. For example, some data processing and feature engineering steps may depend on external data sources that do not produce stable results, such as querying for the current weather or stock prices at the time of prediction or training. Here it is useful to record the exact training data after data processing at training time and to log all inputs to the model (including those gathered from external sources) at inference time.

Further nondeterminism in pipelines can come from insufficient versioning. For example, if data is not versioned, a model might be trained on different data when retrained later. A model may be trained with a different version of the machine-learning library or a different version of the feature engineering code or a different hyperparameter. Again, comprehensive versioning is key.

Determinism in machine-learning pipelines can be tested by learning the same model repeatedly and on different machines and observing that the produced models are exactly identical.

Recommendations for Reproducibility and Replicability

Good coding and versioning practices are just as important as good documentation for reproducibility and replicability in machine learning.

It is important to version data, pipeline, and models as discussed above.

Determinism of all pipeline code use useful and can be tested.

For science and reproducibility more broadly, it is important to document each step, including intentions and assumptions, not just the results. For example, explain why data was cleaned in a certain way and why certain hyperparameters were chosen. This allows others to replicate results, including intentionally varying aspects of the setup.

In general, infrastructure quality helps to foster reproducibility, as it becomes easier to inspect and understand code. This may involve modularizing code in pipelines and testing it (see infrastructure quality lecture).

Finally, containers are a powerful tool to share artifacts including all dependencies and environment configurations. Containers make it much easier to replicate exact results in different execution environments.

Summary

Versioning and provenance tracking are important for debugging and accountability in responsible machine learning. This involves tracking the origins and changes to data, pipelines, and models. It fosters reproducibility and replicability. Beyond provenance tracking, a key is to version everything, including versioning data, versioning entire pipelines and their dependencies, and versioning the resulting models.

Further reading:

  • Technical debt and visibility debt: Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. “Hidden technical debt in machine learning systems.” In Advances in neural information processing systems, pp. 2503–2511. 2015.
  • Automated provenance tracking within the Google infrastructure: Halevy, Alon, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. “Goods: Organizing google’s datasets.” In Proceedings of the 2016 International Conference on Management of Data, pp. 795–806. ACM, 2016.
  • Internals of how Git stores data versions: Scott Chacon and Ben Straub. Pro Git, Chapter 10. Apress 2014.
  • Data lineage tracking at fine granularity in Apache Sparks: Gulzar, Muhammad Ali, Matteo Interlandi, Tyson Condie, and Miryung Kim. “Debugging big data analytics in spark with bigdebug.” In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1627–1630. ACM, 2017.
  • Overview of MLOps: https://ml-ops.org/
  • More advice on provenance tracking: Sugimura, Peter, and Florian Hartl. “Building a Reproducible Machine Learning Pipeline.” arXiv preprint arXiv:1810.04570 (2018).
  • Discussion of challenges of using computational notebooks, including versioning challenges: Chattopadhyay, Souti, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. “What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, 2020.
  • Versioning challenges in computational notebooks: Kery, Mary Beth, Bonnie E. John, Patrick O’Flaherty, Amber Horvath, and Brad A. Myers. “Towards effective foraging by data scientists to find past analysis choices.” In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–13. 2019.
  • A discussion of the different notions of replicability and when they are useful: Juristo, Natalia, and Omar S. Gómez. “Replication of software engineering experiments.” In Empirical software engineering and verification, pp. 60–88. Springer, Berlin, Heidelberg, 2010.

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store