Technical Debt in Machine-Learning Systems

Christian Kästner
12 min readSep 10, 2021

--

This post covers content from the “process and technical debt” lecture of our Machine Learning in Production course. For other chapters see the table of content.

Technical debt is a powerful metaphor to think about trading off short-term benefits with later or long-term costs. In monetary terms, going into debt by taking out a loan provides the borrower with money that can be used immediately but needs to be repaid later with accumulated interest. In the software development analogy, certain actions provide short-term benefits, such as faster progress and earlier releases, but come at the cost of lower productivity, needed rework, or additional operating costs later. Just as with financial debt, technical debt can accumulate and suffocate a project when it becomes no longer possible to productively continue to develop and maintain a system due to old decisions that would have to be fixed first.

The power of the technical debt metaphor comes from its simplicity and ease of communicating implications with non-developers, including managers and customers. For example, using the debt metaphor it is easier to explain why developers want to spend some time on infrastructure and maintenance work now rather than spending all their time on developing new features, as it conveys the dangers of delaying infrastructure and maintenance work and the delayed costs of using shortcuts in development. In short “technical debt” is management-compatible language for software developers.

Technical debt comic by Cornel

Scenario: Automated Delivery Robots

As a running example in this chapter, let’s consider autonomous delivery robots that are currently deployed on sidewalks in many cities. Typically the size of a small suitcase, these robots bring food deliveries from restaurants to homes within a neighborhood. Based on camera inputs and GPS location, they navigate largely autonomously on sidewalks, which they share with pedestrians; typically human operators can take over and navigate the robot remotely if problems occur. Sidewalk robots can have much larger capacities and range than aerial drones.

(Kiwibot, CC-BY-SA-4.0 by Ganbaruby)

Deliberate and prudent technical debt

In casual conversations, technical debt is often used to refer to bad decisions in the past that now slow down the project or require rework. However, Fowler provides a more nuanced view of different forms of technical debt along two dimensions:

Deliberate vs inadvertent: As with most financial debt, developers may intentionally decide to make a shortcut, knowing that it will cost them later. For example, the developer of the delivery robot may decide to build and deploy a obstacle detection model in their robots without investing in an automated pipeline (e.g., just copying data, run a notebook, and then copy the model to the robot), knowing that it will slow down future updates, make it harder to compare different models or run experiments in production (see Testing in Production chapter), and make it riskier to introduce subtle faults (see Infrastructure Quality chapter); they know that they will have to build an automated pipeline eventually and will have to deal with more issues that may arise in the meantime, but decide to prioritize getting a first prototype out now over maintainable infrastructure. That is, they deliberately trade off short term benefits with later costs.

In contrast, developers may inadvertently make decisions that will incur later costs without thinking through the benefit-cost tradeoff. For example, the robot’s data science team may not even be aware of the benefits of pipeline automation and only later realize that their past decisions slow down their current development as they spend hours copying files and remembering which scripts to call every time they want to test a new release. Inadvertent technical debt often comes from not understanding the consequences of certain engineering decisions or simply not knowing better practices. Only later when they face a problem and explore possible solutions, developers may learn about how they could have built the system better at an earlier stage.

Reckless vs prudent: In the financial setting, a bank will often assess the risk of a loan to evaluate whether the money is used for an investment (e.g., a new factory) that likely leads to sustained benefits such that the loan can be repaid. In contrast, loans for direct consumer spending may not create new sustained value; for example, living beyond one’s means with credit cards does not provide an obvious path to higher income to ensure that the debt can be paid back.

The same way, certain engineering decisions that favor short-term benefits over long-term maintenance costs may be a prudent, good investment. For example, while delivery robots are a new hot market with many competitors, it may be very valuable to have a working prototype and deploy them in many cities to attract investors and customers, so it may be worth deploying many robots early even when the navigation models are still low quality, paying instead heavily for human oversight that will remotely operate the robots in many situations. Similarly, many companies may decide to take big shortcuts that incur significant operating and maintenance costs to acquire lots of users quickly in a market with strong network effects where value is made through the size of the user base (e.g., social networks, video conferencing, dating sites, app stores).

In contrast, technical decisions can also be reckless, where shortcuts are taken without investing in future success or by causing future costs that are suffocating the project. For example, forgoing safety testing for the perception models with the delivery robots may speed up development, but may cause bad publicity and serious financial liability in case of accidents. Alternatively, not having team meetings may make development marginally faster, but introduces comparably large dangers of incompatibilities between components developed by different teams and ensuing integration issues that far outweigh the benefits.

The ideal view of technical debt is that it is deliberate and prudent. Ideally, developers consider their options and deliberately decide that certain benefits as faster releases are worth the costs. Hopefully, deliberate but reckless decisions are rare in practice. Of course, inadvertent technical debt can also turn out to be prudent, for example, when realizing later that operating the robots with lots of human interventions early on (which happened by necessity due to the inaccuracy of the initial models) was actually a good idea in retrospect, because a competitor would have bet the company to market if they had spent time perfecting the models first. In practice though, many forms of technical debt will be inadvertent and reckless. Not knowing better or under immense stress to deliver now, developers may take massive shortcuts that turn out to be very expensive later.

Technical Debt in Machine Learning Projects

Machine learning makes it very easy to inadvertently accumulate massive amounts of technical debt, especially when inexperienced teams try to deploy machine-learning products in production.

In a very influential article, originally titled Machine learning: The high interest credit card of technical debt,” Sculley and other Google engineers describe many ways in which machine learning projects incur additional forms of maintenance costs beyond that of traditional software projects that easily amount to technical debt.

First of all, if used inappropriately, the use of machine learning itself can be seen as technical debt: Using machine learning can seem as a quick and easy solution to a problem, especially when machine learning is hyped in the media and by consultants, but it can come with long term costs that the initial developers may not anticipate. As we discuss throughout this book, engineering production-quality machine-learning components can be expensive and induce long-term maintenance costs. Developers may decide to use machine learning as a gimmick or to solve a problem where established solutions exist, not anticipating the costs later — this could be inadvertent, reckless technical debt. In contrast, using machine learning where appropriate, for example, adopting machine learning to avoid human operation of the delivery robots, may be expensive but necessary for the system, hence would not be technical debt at all. The key point is that adopting machine learning should be a deliberate decision, clearly anticipating future costs.

In addition, there are many good engineering practices that reduce maintenance costs and risks in machine learning projects, many of which are described throughout this book (e.g., risk analysis and planning for mistakes, system architecture design, robust pipelines, testing in production, fairness evaluations, threat modeling, provenance tracking). All of these practices require an investment to perform some analysis or develop some infrastructure. They do not eliminate all maintenance costs, but skipping these steps can cause significant long-term costs. In this way, while some operating costs are unavoidable, skipping good engineering practices can often be seen as leading to avoidable technical debt. Among others, Sculley’s paper discusses:

  • Data dependencies: The quality and quantity of data on which a machine-learning project depends may change over time, either through changes at the data source or in different data processing steps along the way, resulting in the need for continuous data quality checking and continuous monitoring of the systems in production (see Testing in Production lecture). For example, ideally we should create a schema for the data, monitor distribution shift, and track provenance of data gathered from the robot, such that software updates changing some data stream or a smudged lense of the robot’s camera do not result in crashes. If cleaning and quality assurance for data is not automated and monitored, subtle shifts can break the system or let it degrade slowly over time (see Data Quality lecture).
  • Monitoring debt: Evaluating a model on a static dataset is easy, but building a monitoring infrastructure to evaluate model quality in production (see Testing in Production chapter) requires substantial cost early on. Forgoing to build a robust monitoring infrastructure can cause high costs later when it is not noticed how the model quality degrades or how it serves certain populations poorly. For example, we may not notice how the robots have an unusual number of accidents in rainy weather conditions.
  • Low code quality, lack of pipeline automation: It is easy to develop models with powerful libraries on a snapshot of data in a notebook and to copy the resulting model into a production system. Not building a robust pipeline, e.g., automating all steps, removing dead code, monitoring all pipeline steps (see Infrastructure Quality chapter) and not writing usable and maintainable code (e.g., using abstraction, documenting code, avoiding code clones) might save some effort initially but can significantly increase effort for future releases and debugging. For example, we may run the robots on old and inconsistent versions of the models because we did not notice that the update process failed on most of them.
  • Visibility debt: Models are often learned from data from various sources and model predictions may be used by various other components in a system. Without clear data versioning and provenance tracking (see Provenance chapter) it is easy to lose track of dependencies, making it challenging to evolve the system. For example, we may not remember that training data for the obstacle detection module in the robot was augmented to increase model robustness and forget this step in future releases, resulting in much weaker models.
  • Entanglement: It is easy to build products that use multiple machine-learned models and have some models consume outputs of other models. However, it is difficult to separate the effects of these models in a project and to test them modularly. Significant effort may be needed for isolating models in the system design and for system-level testing and monitoring (see Quality Assurance chapters).
  • Feedback loops: Careful requirements analysis can help to detect potential feedback loops and design interventions as part of the system design, before they become harmful; skipping such analysis may lead to feedback loops manifesting in production that cause damage and are harder to fix later (including serious ethical and safety implication with long-term harms). For example, a delivery robot optimized for speedy delivery times may develop risky driving behavior, leading to pedestrians who mostly evade the robot, leading to even higher speeds and positive reinforcement for faster deliveries, but less safe neighborhoods.

The key point is that machine learning brings with it many complexities that, if not controlled, can cause significant long-term costs. Describing those as technical debt may help to convince engineers and managers why they should invest in provenance tracking, monitoring, and other good engineering practices, for example, by hiring more team members or prioritizing infrastructure improvements over developing new features.

Managing Technical Debt

Again, technical debt is not an inherently bad thing that needs to be avoided at all cost, but ideally, taking on technical debt should be a deliberate decision. Rather than inadvertently omitting provenance tracking or monitoring in production, one should consider the potential long-term costs of these decisions and deliberately decide whether skipping or delaying such infrastructure investment is worth the benefit, such as moving faster toward a first release. Making deliberate decisions about technical debt may require learning about best practices and tools first.

Not all technical debt can be repaid equally easily. For example, skipping risk analysis upfront may lead to a fundamentally flawed system design that will be hard to fix later; for example, we may realize too late that our robots really should have redundant sensors, because otherwise we cannot assure safe operations on sidewalks. Other delayed actions may be easier to catch up; for example, if we do not build a monitoring infrastructure today, we have little visibility in how the model performs, but we can add such infrastructure later without redesigning the entire system. This way, design-related technical debt is usually much harder to fix than automation-related technical debt.

In many cases inadvertent technical debt can be avoided through education or by making it easier to do the right thing. For example, if using version control is part of the team culture, it is less likely to acquire technical debt from poor versioning. Similarly, if risk analysis is a normal part of the process or team members regularly ask questions about fairness in meetings, engineers are less likely to skip those steps without a good reason. Especially, in more mature organizations, it may be easier to provide or even mandate a uniform infrastructure that automates important steps or prevents certain shortcuts, such as ensuring that all machine learning is conducted in a managed infrastructure that automatically versions data, tracks data dependencies, and tracks provenance (see Test Automation and Provenance chapters).

If a decision is made to take on technical debt, the debt should be managed. At a minimum, the necessary infrastructure and maintenance work should be tracked in an issue tracker. For example, when skipping to build a robust pipeline, it is a good idea to add a reminder to invest in a robust pipeline as a todo item for later (either as a short-term entry in a product backlog or as a long-term strategic plan in a product roadmap), as well as to add a regular reminder or process step to manually ensure that the deployment of a new version has actually succeeded. Ideally, tracked technical debt is assigned to a person responsible, the extra maintenance cost incurred is monitored, and specific goals are set for whether and when to pay it back. Some organizations adopt “fixit” days or weeks where engineers interrupt their usual work to focus on addressing a backlog of technical debt. Others may hire developers that specialize in infrastructure and support the existing teams.

Summary

Technical debt is a good metaphor to communicate the idea of taking shortcuts or delaying important work in order to get some short-term benefits (usually faster releases) at some long-term cost (usually lower productivity, higher maintenance cost). Ideally, taking on technical debt is a deliberate and prudent decision with debt then managed and actively repaid. In practice though, technical debt can often occur inadvertently and recklessly, often through inexperience or external pressure.

Machine learning brings many additional challenges that can easily result in high long-term maintenance costs if not addressed aggressively through design, infrastructure, and automation as discussed throughout this book. It is easy to build and deploy a machine-learned model quickly and dirty, but it may require significant engineering effort to build a maintainable production system that can be operated with reasonable cost and confidence. Delaying design, infrastructure and automation investment might be prudent but should be considered deliberately, not done out of ignorance. If deciding to delay such important work, it is a good idea to track the resulting technical debt and ensure that time is eventually allocated to pay it back.

Further readings

--

--

Christian Kästner
Christian Kästner

Written by Christian Kästner

associate professor @ Carnegie Mellon; software engineering, configurations, open source, SE4AI, juggling

No responses yet