Data Science and Software Engineering Process Models

Christian Kästner
24 min readMar 4, 2022

This post covers most content from the “Process and Technical Debt” lecture of our Machine Learning in Production course. It expands on discussions from an earlier blog post. For other chapters see the table of contents.

Waterfall in Robert H. Treman State Park (Ithaca, NY)

Production systems with machine-learned components are usually built by interdisciplinary teams involving data scientists, software engineers, and many other roles. Projects building such projects must integrate work on building both ML and non-ML components. However, the processes used by data scientists and software engineers differ quite significantly, potentially leading to conflicts in interdisciplinary teams if not understood and managed accordingly.

Data Science Process

The typical machine-learning project builds a model to make predictions for a specific task from data and goes through many steps in the process. In an educational setting of a machine-learning lecture, the prediction problem and data is often given and the goal is to learn a model with high accuracy on held-out data, mostly focused on feature engineering and modeling. In a production setting, data scientists are often involved in many additional tasks of shaping the problem, cleaning data, and deploying the model and possibly also shaping data collection as well as monitoring the final model when it is integrated into a larger system.

This large number of different necessary steps is captured in many process models for data scientists, such as the typical view of a machine-learning pipeline shown in earlier chapters. Although some process models may initially look like model development follows a linear process, the process is actually highly iterative and exploratory, visible usually through the backward arrows in such pipeline visualizations.

A typical view of the machine-learning pipeline as a sequence of consecutive steps with feedback loops.

Typically, a data scientist would start with an initial understanding of the problem, would try an initial model with some features and evaluate that model, to then go back to improve results by exploring alternative steps. This may involve collecting more or different data, cleaning data, engineering different features, learning different kinds of models, or trying different hyperparameters. Typically, such a process is exploratory in that a data scientist would try many different alternatives, discarding those that do not improve the model.

Model development uses a science mindset: Start with a goal and iteratively explore whether a solution is possible and what it may look like. The iterative and exploratory nature of data science work mirrors that of many traditional scientists, say in physics or biology, where scientists study phenomena with experiments, analyze data, look for explanations, plan follow up experiments, and so forth until patterns and theories emerge. Exploration may follow intuition or heuristics, or may follow even a more structured and hypothesis driven approach, but it is often difficult to plan upfront how long it will take to create a model because it is unknown what solutions will work. It is very much a try-and-error approach of iteratively exploring ideas and testing hypotheses. The exploration may include going back to data collection and cleaning as necessary, and even adjustments to the system or model goals as data scientists understand better what is possible with the data.

This iterative and exploratory process has been repeatedly observed and confirmed in studies of data scientists. For example, the plot below shows how 10 participants in a study built a model for a recognizing handwritten digits over a period of 4 hours: They start with a simple model, typically yielding low accuracy, and incrementally improve it over the study’s period, trying alternatives and keeping those that improve the model. Progress is not predictable, often with long stretches of time spent on exploration without any accuracy improvements; also the rate of improvements and the end results differ substantially.

Development of a model of hand-written digit recognition, plotting accuracy of each participant’s model over the course of two 2-hour sessions of the study, from Patel, Kayur, James Fogarty, James A. Landay, and Beverly Harrison. “Investigating statistical machine learning as a tool for software development.” In Proc. CHI, 2008.

The iterative nature is captured through circles and feedback loops in essentially all process models that capture the activities of data scientists, including the pipeline visualization above, the CRISP-DM process model (a cross-industry standard process for data mining), and Microsoft’s Team Data Science Process. Again, all these models emphasize that exploration and iteration is not limited to only the modeling part, but also to goal setting, data gathering, data understanding, data preparation, and other activities.

Process diagram showing the relationship between the different phases of CRISP-DM, by Kenneth Jensen (CC BY-SA 3.0)
Process model of Microsoft’s Team Data Science Process, from Microsoft Azure Team, “What is the Team Data Science Process?” Microsoft Documentation, Jan 2020

Computational Notebooks for Exploration

Computational notebooks like Jupyter are standard tools for data scientists. Though they are sometimes criticized by software engineers for encouraging poor coding practices (e.g., lack of abstraction, global variables, lack of testing) and for technical problems (e.g., inconsistencies from out of order execution), they effectively support exploratory activities by data scientists.

Computational notebooks have a long history and are rooted in ideas of literate programming, where code and text are interleaved to form a common narrative. A notebook consists of a sequence of text or code cells, where code in cells can be executed one cell at a time, and the output of a cell’s execution is shown below the cell. Computational notebooks have been available since 1988 with Wolfram Mathematica, and many implementations exist for many languages on many platforms, but they have exploded in popularity with Jupyter notebooks.

Example of a notebook interleaving code and text cells and showing the output of a code cell below the cell.

Notebooks support iteration and exploration well in many ways:

  • They provide quick feedback, allowing users to write and execute short code snippets incrementally, similar to REPL interfaces for quick prototyping in programming languages.
  • They enable incremental computation, where users can change and re-execute cells without executing the entire notebook, allowing users to quickly explore alternatives.
  • They are quick and easy to use. It is easy to copy cells to compare alternatives without having to invest in upfront abstraction.
  • They integrate visual feedback by showing resulting figures and tables directly within the notebook during development.
  • Integrating results with code and documentation makes it easy to share code and results.

While notebooks are usually not a good tool to develop and maintain production pipelines (though some companies report running notebooks in production), they are an excellent tool for exploration and prototyping. Once exploration is completed and the model should be moved into production, it is advisable to clean the code and migrate it to a more robust, modular, tested, and versioned pipeline implementation (see chapters Automating the ML Pipeline, Infrastructure Quality, and Versioning, Provenance, and Reproducibility).

Data Science Trajectories

While data science projects tend to go through common steps as goal setting, data acquisition, data cleaning, modeling, and evaluation, they do not necessarily do so in the same sequence. Some projects start with clear goals, for example, to transcribe audio or to detect cancer in radiology images. Other projects start with available data and explore opportunities for what could be done with the data, for example, when a travel broker has collected large amounts of credit card transactions and hotel bookings from their customers, and explores opportunities of how it can improve its products without knowing upfront what kind of insights might be gained from the data.

Overall, data scientists may try to identify how data can be used to achieve an existing goal, they may explore data to identify possible goals, they might search for new data to acquire that might provide new opportunities, they may search for ways to turn value extracted from data into products that deliver value to customers, or they may search for (academic) insights in the data.

Data can play different roles too and can come from different sources. It may already be available or may be easy to acquire, but it may also come from integrating data from multiple sources or even from simulation. Some projects may decide to release data.

Martı́nez-Plumed and coauthors use the term trajectory in a recent paper to describe the different paths taken in a data science process and give examples of different projects taking very different paths. For example:

  • A product to recommend trips connecting tourist attractions in a town may be based on location tracking data collected by navigation and mapping apps. To build such a project, one might start with a concrete goal in mind and explore whether enough user location history data is available or can be acquired. One would then go through traditional data preparation and modeling stages before exploring how to best present the results to users.
  • An insurance company tries to improve their model to score the risk of drivers based on their behavior and sensors in their cars. Here, an existing product is to be refined and a better understanding of the business case is needed before diving into data exploration and modeling. The team might spend significant time exploring new data sources that may provide new insights and may debate the cost and benefits of this data or data gathering strategy (e.g., installing sensors in customer cars).
  • A credit card company may want to sell data about what kind of products different people (nationalities) tend to buy at different times and days in different locations to other companies (retailers, restaurants). They may explore existing data without yet knowing what kind of data may be of interest to what kind of customers. They may actively search for interesting narratives in the data, posing questions such as “Ever wondered when the French buy their food?” or “Which places the Germans flock to on their holidays?” in promotional material.
Three different trajectories for the tourist, insurance, and credit card examples; taken from Martínez-Plumed, Fernando, Lidia Contreras-Ochando, Cesar Ferri, Jose Hernandez Orallo, Meelis Kull, Nicolas Lachiche, Maréa José Ramírez Quintana, and Peter A. Flach. “CRISP-DM twenty years later: From data mining processes to data science trajectories.” IEEE Transactions on Knowledge and Data Engineering (2019).

It can be useful to understand possible and appropriate trajectories in a data science project which may have more or less clear upfront goals and more or less possible influence on what data is to be collected.

Software Engineering Process

The software engineering community has long studied and discussed different processes to develop software, often under the term software development lifecycle. Software development involves a number of different steps that focus on distinct activities, such as identifying requirements, designing and implementing the system, and testing, deploying, and maintaining it.

Paying attention to process comes from the insight that just jumping straight into writing some code is often dangerous and ineffective, because one might solve the wrong problem, start implementation in a way that later never scales or meets other important quality goals, and often builds code that later needs significant corrections. Indeed, plenty of empirical evidence supports that fixing requirements and design problems late in a software project is exceedingly expensive. As a consequence, it is a good idea to do some planning before coding, by paying attention to requirements and design.

The larger the distance between introducing and detecting a defect, the higher the relative cost for repair. Data from Bennett, Ted L., and Paul W. Wennberg. “Eliminating embedded software defects prior to integration test.” Crosstalk, Journal of Defence Software Engineering (2005): 13–18.

Similarly, process includes many other non-coding activities in a project that are useful to help a timely and on-budget delivery of a quality product, including activities such as (1) review of designs and code, (2) keeping a list of known problems in an issue tracker, (3) use of version control, (4) scheduling and monitoring progress of different parts of the project, and (5) daily status and progress meetings. While many of these activities require some upfront investment, using them promises to avoid later costs, such as missing long undetected design problems, releasing software with forgotten bugs, being unable to recover past versions when having to roll back changes, detecting late detection of delays to the project, and team members doing redundant or competing work by accident. Lack of a process may lead developers to slip into survival mode when things go wrong, where they focus on their own deliverables but ignore integration work and interaction with others, leading to even more problems down the line.

Process Models: Between Planning and Iteration

Waterfall model. The earliest attempt to structure development and encourage an early emphasis on requirements and design was the waterfall model that describes development as an apparent sequence of steps (not unlike the pipeline models in data science). The key insight was that some rigor in process is useful — for example ensuring that requirements are established before coding, that design respects requirements, and that testing is done before release. Even the first versions of the waterfall model in Royce’s original 1970 paper “Managing the development of large software systems” suggested some feedback loops and iterations, where one would go back from design to requirements if needed and one might need to go through the waterfall twice, once for a prototype and once for the real system. While often considered as dated and too inflexible to deal with changing requirements in many projects these days, the key message of “think and plan before implementing” has endured.

Simple visual representation of the waterfall process model.

Spiral model. Subsequent process models placed a much stronger emphasis on iteration, emphasizing that it is often not possible or advisable to perform all requirements engineering or design upfront. For example, if a project contains a risky component, it may waste significant resources to perform full requirements analysis and design before realizing that a core part of the project is not feasible. The spiral model hence emphasizes iterating through the waterfall steps repeatedly, building increasingly complete prototypes, starting with the most risky parts of the project.

The spiral model suggests building a series of prototypes, interleaving requirements, design, implementation, and testing activities, starting with the most risky components. Figure based on Boehm, B. W. (1988). A spiral model of software development and enhancement. Computer, 21(5), 61–72.

Agile methods. Another factor to push for more iteration is the realization that requirements are hardly fixed and fully known upfront. Customers may not really know what they want until they see a first prototype and by the time the project is complete, customers may already have different needs. Agile methods push for frequent iteration and replanning by self-organizing teams who work closely with customers. Note that agile methods do not abandon requirements engineering and design, but they integrate it in a constant iterative development cycle with constant planning and re-planning. For example, a typical pace is to develop incremental software prototypes in 2-week or 30-day sprints, synchronizing daily among all team members in standup meetings, and cycling back to planning after every sprint. In the process, work is extended and changed as needed as more functionality is added or requirements change.

Agile development with constant iteration, for example, through sprints and weekly standup meetings. (CC BY-SA 4.0, Lakeworks)

All process models in software engineering make tradeoffs between the amount of upfront planning (requirements, design, project management) and the frequency of iteration reacting to changes. Large projects with stable requirements or important safety or scalability qualities (e.g., aerospace software) often adopt more waterfall-like processes, whereas projects with quickly evolving needs (e.g., social media) and startups focused on early prototypes tend to adopt more agile-like strategies. Many, but not all, ML-enabled systems fall into the latter category.

Tensions between Data Science and Software Engineering Processes

Production systems with ML components require both data science and software engineering contributions and hence need to integrate processes for both. Teams often clash when team members with different backgrounds have different assumptions about what makes a good process for a project and do not necessarily understand the needs of other members with different specialties.

While the iterative data science process may look at first glance similar to the iteration in more iterative development processes, they are fundamentally different:

  • Iteration in the spiral process model focuses on establishing feasibility early by developing the most risky part first, before tackling less risky parts. In machine-learning model development, work is usually not easily splittable into risky and less risky parts. One can try to establish feasibility early on, but often it is not clear whether a desired level of accuracy is achievable for a prediction task with a limited-scope project.
  • Iteration in agile development aims in-part to be flexible to clarify initially vague requirements and react to changing requirements. In machine-learning model development, the general goal of a model is usually clear and does not change much, but iteration is needed to identify how to fulfill that goal. If the initial goal cannot be met or new opportunities are discovered, requirements of the system and model may change over time.
  • In software engineering, a problem is usually decomposed to enable a divide and conquer strategy to make progress on different parts of the problem independently, often enabling parallel development with multiple team members. Machine learning problems are usually harder to decompose into problems that can be solved independently.
  • Iteration in machine-learning model development is exploratory in nature, with many iterations resulting in dead ends. Outside of highly experimental projects, dead ends are less common in traditional software engineering. Experimentation toward a solution is not unusual, but mostly not the key reason for iteration.

Overall, traditional software projects have much more of an engineering than a science flavor, with clearer decomposition and a more predictable path toward completion. While engineering projects are usually also innovative and not routine, they are usually easier to plan and to address with established methods in a systematic way. In this context, iteration helps in planning, monitoring progress, and reacting to changes. While developers also run into surprises, delays, and dead ends, traditional software engineering usually involves much less open-ended experimentation on problems for which we may not know whether there are feasible solutions at all.

Conflicts about engineering practices. The differences in processes and working styles can lead to conflicts. For example, countless articles and blog posts have been written by software engineers about poor engineering practices in data science code, especially in notebooks. Software engineers complain about a lack of abstraction, heavy copying of code for reuse, missing documentation, pervasiveness of global state, lack of testing, poor version control, and poor tool support for developers (compared to common software development tools including integrated development environments with autocompletion, refactoring, debuggers, static analysis, version control integration, continuous integration). However, given the exploratory nature of data science, upfront abstraction and documentation of exploratory code is likely of little value and copy-paste reuse and global state are likely convenient and sufficient. Documentation and testing are simply not a priority in an exploratory phase.

Conflicts in effort estimation. While software engineers are notoriously bad at estimating needed effort for building a system, they usually have some idea how long it will take and can get better with experience, practice, and dedicated estimation methods. Given the exploratory and science-based working style, data scientists may have little chance of estimating how much time is needed to achieve an acceptable level of accuracy for a machine learning problem or whether it is even feasible to reach that level at all. This difficulty of estimating time and effort can make planning projects with ML and non-ML components challenging and can frustrate software engineers and managers that do not understand the data science process.

Conflicts from misunderstanding data science work. Studies show that many team members in software engineering and management roles lack an understanding of data science and often hold misleading views. For example they may naively request that data scientists develop models that do not have false positives. Studies of software engineers who build machine learning models without data science training show that those software engineers have a strong bias for coding, little appreciation for data exploration and feature engineering, and are often naive about overfitting problems and evaluating model quality. For example, in their paper “Grounding Interactive Machine Learning Tool Design in How Non-Experts Actually Build Models,” Yang and collaborators observed that when an initial model is less accurate than needed, developers without data science training tend to not engage with an iterative data exploration and feature engineering process but rather jump straight to collecting more data or trying deep learning as the next steps. That is, software engineers and managers with only a superficial understanding of data science will often misunderstand how data scientists work and are likely to impose their own biases and preferences in a collaboration.

Integrated Processes for AI-Enabled Systems

Production machine-learning projects need to integrate software engineering and data science work and need to respect the processes of both fields. In addition, as emphasized repeatedly throughout this book, it is important to adopt a system-wide perspective and understand how ML components fit into the system. To the best of our knowledge no accepted “best practices” or integrated workflow models exist at this point.

Let’s consider an idealized process abstraction that considers both the system and its ML and non-ML components. To develop a system, we need to understand it’s requirements and can then decompose the work into components with component requirements derived from the system requirements. The process to develop the individual component can be tailored to the needs of that component, for example, using established software engineering or data science processes as appropriate. Ideally, components can be largely developed and tested independently from other components, before they are all integrated into the overall system. Notice how ML and non-ML components can naturally be developed with different processes and tools, but how those are integrated into an overall process for developing the system. Iteration happens both at the level of components and at the level of the entire system.

Iterative development of a system and iterative development of the components of the system for both ML and non-ML components.

Following a process that explicitly acknowledges the relationship between the system and its components and the relationships among components helps to align and integrate all parts of the project. Starting with some system-level requirements and system-level architecture work before diving immediately into implementation and machine learning helps to frame component requirements, such as accuracy, latency, privacy, explainability, or fairness needs of ML components. It also helps to anticipate interactions among components, such as how user interface design influences data collection and how model accuracy influences user interface design and safety guardrails.

Model-First versus System-First Development

In practice, few projects will follow an idealized top-down process, decomposing component requirements after establishing system requirements. In line with different trajectories in data science discussed above, we will see that different projects for ML-enabled systems differ in how they sequence activities and where they focus their attention. A key difference we observed across many projects in our own research is that the order and priority given to the system versus the model differ widely.

Model-first development. In many projects, the model is developed first and then a system is built around the model once feasibility and capability of the model are established. In many projects, data scientists explore opportunities in existing data before any non-ML components are built or the overall system is planned. If they find interesting patterns (e.g., the ability to predict whether somebody is likely to book a hotel for business or leisure) or can establish new model capabilities (e.g., new NLP models that can identify and summarize opposing views in a conversation), and only then, does the team explore how to turn this into a new product or enhance existing products. In such settings, system development commonly is an afterthought, not given many resources.

Model-first development is particularly common in projects where the machine-learning components form the core of the system and the non-ML parts are considered small or less important. Our transcription service example from the introduction is an example of this, with substantial non-ML code, but the system functionality entirely depends on the machine-learned model. A demand forecasting system that sends daily email is another example, this one with minimal non-ML code around a model. Often such projects are started by data scientists, bringing in other team members only later.

Model-first development has several benefits. In many cases it is not clear what is achievable with machine learning upfront — without it, it may not be possible to plan the system at all. Given that the model is often the most uncertain part of ML-enabled systems, focusing early on the model avoids heavy investments into projects that turn out eventually to be entirely unfeasible. Furthermore, since model-first development is not constrained by system requirements and derived model requirements, data scientists have much more flexibility as they creatively explore ideas. Data scientists may feel more comfortable performing exploratory work (e.g., in notebooks) with little emphasis on building, testing, and documenting stable ML pipelines.

However, model-first development also has severe drawbacks. Turning a model prototype into a product without any earlier system-level planning can cause a lot of friction, because model qualities and goals may not match system needs and goals. Data scientists often complain that they receive little guidance and no concrete requirements for their work beyond a vague goal. There are plenty of reports where the model had to be abandoned or redeveloped entirely for a production system, for example, when model latency was unacceptable for production use, when users would not trust an automated system, or when important explainability requirements were neglected. Many reports also argue that user experience suffers because broader requirements (e.g., fairness, usability, feedback loops, safety) were ignored with a model-centric focus in the project.

Even when pursuing a model-first approach, it seems prudent to perform at least some minimum exploration of system-level requirements and some minimum planning early on. In addition, once model feasibility is established, the team should take system-level design seriously and include a broader group of stakeholders in planning the system around the model.

System-first development. Far from all systems are developed model-first. In many projects (about half in our own observations) there is a clear upfront vision of a product and the machine-learned model is planned as a component in this product — closer to our idealized integrated process discussed above. Here system goals are largely established upfront and many non-ML components may already exist before the model is developed. Data scientists may receive concrete goals and requirements for their model development, derived from the system requirements.

System-first development is particularly common when machine-learning components are added to support a new feature in an existing system, such as the audit risk prediction in the tax software example from chapter From Models to Systems. System-first development may also be advisable for new ML-enabled systems in regulated domains (e.g., banking, hiring) or domains with strong safety requirements (e.g., automotive systems, drones).

Benefits and drawbacks of system-first development mostly mirror those of model-first development: System-first development fosters system-wide planning, helping with designing better user experiences and safer systems. It leads to clearer model requirements (including latency, fairness, explainability), ensuring that models are designed to be compatible with production environments. At the same time, it may be unclear whether the intended model is feasible at all and it may constrain the creativity with which data scientists explore new opportunities in the data. Data scientists in projects following a system-first approach often complain about receiving unrealistic model requirements, where the system designers may make unrealistic promises to clients.

Given the uncertainty of building ML components, also a system-first approach will likely need to involve data scientists and at least some experimentation and prototyping to identify what system-level requirements are realistic.

Balancing system and model work. In the end, neither a fully model-first approach nor a fully system-first approach seems practical. Some consideration of how the model would be used in a system benefits projects that prioritize models and early involvement of data scientists and some experimentation during system planning helps to avoid unrealistic expectations in projects that focus on the system first. However, completely parallel development of the system and its ML and non-ML components is likely only feasible in smaller projects with few enough teams to support the necessary frequent iteration and coordination, constantly renegotiating requirements and component interfaces. In larger projects, some more top-down planning may become necessary.

This tension once more emphasizes the importance of deliberately adopting a process of iteration and continuous (re-)planning — as recognized in the long history of process models developed in the software engineering and data science communities.

Examples

To wrap up this section, let us consider a few examples of systems and speculate what kind of processes might be appropriate for them.

Enhancing presentation software with an automated smart slide layout feature (as in the PowerPoint Designer project) adds a machine learning component in an existing traditional non-ML software project. The addition is a useful feature that helps users design nicer presentations and it helps distinguish the software from its competitors, but it is not essential to the product. The feature may have emerged from giving a team of data scientists a vague charter of identifying how they can “make everyone’s presentation designs more impactful and effortless.” The data scientists may explore different ideas and establish feasibility before considering how it may integrate into the user interface and the user experience at large (e.g., whether users need to press a button to request suggestions or whether the system would automatically make suggestions or changes).

A startup developing an automated transcription service (as in the introduction) is likely focused primarily on the model and builds a system around the model later. If the team seeks venture capital, they need to identify a plausible business model early on, but otherwise the team will likely focus almost exclusively on the model to establish feasibility of accurate predictions before thinking about operating costs and the rest of the system. At some point, the team needs to look beyond the ML component and understand the requirements for the whole system and what adjustments need to be made to the prototype ML model, for example, in terms of operating cost, latency, explainability, or robustness requirements.

Fine-tuning a recommendation feature of a hotel booking site may start with opportunistically logging user data and then exploring that data to develop ideas about what features could be predicted about potential customers (e.g., whether they are flexible about travel times, whether they travel for business or leisure). The non-ML part of the system might be changed to collect better telemetry, e.g., by encouraging users to provide their age or information about their travel plans after they booked a hotel with the current system. Data scientists could then iterate on data exploration, model development, and experimenting with integrating prototype models with the existing system without user-visible product features as milestones.

Designing a new system for recidivism prediction to be used by judges for sentencing decisions, regulatory requirements may require a top-down design, where it is first necessary to understand design constraints on the system, such as fairness and explainability requirements and how humans would interact with the system, which influence subsequent decisions in data collection and modeling, but also user interaction design. A waterfall-like approach may be appropriate in such a high-stakes setting with substantial scrutiny and possible regulation.

Summary

Software engineers have long discussed different process models of how work should be organized, emphasizing the need to plan before coding and the need for iteration. For data science projects, process models have also been developed capturing the iterative and exploratory nature of machine-learning work. Despite some similarities in the need for iteration, the working styles differ significantly in their degree to which they are exploratory and plannable, leading potentially to conflicts and planning challenges that need to be managed in software projects with machine-learned components.

Further readings

As all chapters, this text is released under Creative Commons 4.0 BY-SA license.

--

--

Christian Kästner

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling