Introduction to Machine Learning in Production

19 min readSep 17, 2021

This chapter covers material from the introductory lecture of our Machine Learning in Production course. For other chapters see the table of content.

Video recording of the first lecture of our corresponding course, which covers many of the same topics.

Recently, machine learning has enabled incredible advances in the capabilities of software systems, allowing us to design systems that would have seemed like science fiction only one or two decades ago, such as medical assistance, personalized recommendations, interactive voice-controlled bots, and self-driving cars. Machine learning components are routinely integrated in web applications, mobile phone apps, and off-the-shelf devices, such as automatically generating personalized playlists, automatically improving photos taken with smartphones, and instructing camera traps to record only certain animal activities. Machine learning is an incredibly popular and active field, both in research and practice.

Yet, building and deploying products that use machine learning turns out to be incredibly challenging. Consultants report that 87% of machine learning projects fail and 53% do not make it from prototype to production. Building an accurate model with machine learning techniques is already difficult, but building a product and a business requires also collecting the right data, having the right business model, building an entire software system around the model, and successfully deploying, scaling, and operating the system with the machine-learned models. Pulling this off successfully requires a wide range of skills, including data science and statistics, but also domain expertise and data management skills, software engineering capabilities, and business skills among many others.

In this book, we focus on the software engineering aspects of building software systems with machine learning components that can actually be used in production. It discusses how introducing machine learning creates new challenges for software projects and how to address them. It discusses how data scientists and software engineers need to work together and need to understand each other to build not only models but production systems. This book has an engineering mindset: Building systems that are usable, reliable, scalable, responsible, safe, secure, and so forth — but also do so with limited time, budget, and information. We rarely ever have enough time, budget, and information to build the perfect system, but will need to make compromises, navigate tradeoffs, and use responsible engineering judgment to make decisions.

Example: Automated Transcription Startup

Let us illustrate the challenges of building a production machine learning system with an illustrative scenario, set a few years ago when deep neural networks first started to dominate speech recognition technology.

Assume, as a data scientist, Sidney has spent the last couple of years at a university pushing the state of the art in speech recognition technology. Specifically, the research was focused on making it easy to specialize speech recognition for specific domains to detect technical terminology and jargon in that domain. The key idea was to train neural networks for speech recognition with lots of data (e.g., PBS transcripts) and combine this with transfer learning on very small annotated domain-specific datasets. Sidney has demonstrated the feasibility of the idea by showing how the models can achieve impressive accuracy for talks in medicine, discussions on poverty and inequality research, and talks at Ruby programming conferences. Sidney managed to publish the work at high-profile academic machine-learning conferences.

During this research, Sidney talked with friends in other university departments who frequently conduct interviews for their research. They often need to transcribe recorded interviews (say 30 interviews at 40 to 90 minutes each) as text for further analysis, and they are frustrated with current transcription services. Most researchers use transcription services that employ other humans to transcribe the audio recordings (e.g., by hiring crowdsourced workers on Amazon’s Mechanical Turk in the backend), usually priced at about 1.50 dollars per minute and with a processing time of several days. At this time, a few services for machine-generated transcriptions and subtitles exist, such as Youtube’s automated subtitles, but they were not of great quality, especially when it comes to technical terms (today systems have much improved; consider, for example, temi.com as a start-of-the-art system in this field). Similarly, Sidney found that conference organizers are increasingly interested in providing live captions for talks to improve accessibility for conference attendees. Again, existing solutions often have humans in the loop and are expensive, or they provide poor quality captions if entirely automated.

Seeing the demand for automated transcriptions of interviews and live captioning and having achieved good results in her research, Sidney decides to try to commercialize the domain-specific speech-recognition technology by creating a startup with some friends. The goal is to sell domain-specific transcription and captioning tools to academic researchers and conference organizers, undercutting prices of existing services at better quality.

Sidney quickly realizes that, even though the models trained in research are a great starting point, it will be a long path toward building a business with a stable product. To build a commercially viable product, there are lots of challenges:

In academic papers, Sidney’s models have outperformed the state-of-the-art on accuracy measured on test data by a significant margin, but audio files received from customers are often more noisy than those from public radio used for training and evaluation in academic research.
During research, it did not matter much how long it took to train the model or transcribe an audio file. Now customers get impatient if their audio files are not transcribed within 15 minutes and for live captioning it needs to be almost instantaneous. Making model inference faster and better scalable to deal with many transcriptions in parallel suddenly becomes an important focus for the team. Live captioning turns out to be probably unrealistic unless expensive specialized hardware is shipped to the conference venue to achieve latency acceptable in real-time settings.
The startup wants to undercut the market significantly and offer transcription at very low costs. However, both training and inference (the actual transcription with the model) are computationally expensive, so the amount of money paid to the cloud service provider substantially eats into the profit margins. It takes a lot of the team’s time to figure out a reasonable and competitive price to charge to customers.
While previously fully focused on data-science research, the team now needs to build a website where users can upload audio files and see results — which is not necessarily what they enjoy doing or have experience with. The user experience makes it clear that the website was an afterthought — it’s tedious to use and looks dated. Developers now also need to deal with payment providers like credit-card companies. Realizing in the morning that the website has been down all night or that some audio files have been stuck in a processing queue for days is no fun. They know that they should make sure that customer data is stored securely and privately, but they have little experience with this, and it is not a priority right now. Hiring a frontend web developer has helped with making the site look better and easier to change, but communication between team members with different backgrounds and making sure that the model, backend, and user interface work well together turns out to be much more challenging than anticipated.
The models were previously trained with many manual steps and a collection of scripts. Now, every time the model is improved or a new domain is added, someone needs to spend a lot of time on re-training the model and dealing with problems, restarting jobs, tuning hyperparameters and so forth. Nobody has updated the Tensorflow library in almost a year, out of fear that something might break. Last week, a model update went spectacularly wrong, causing a major outage and a long night of trying to revert to a previous version and rerunning a lot of transcriptions for affected customers.
Customer feedback is mostly positive but some are constantly unhappy and some report some pretty egregious mistakes (e.g., a diagnosis incorrectly transcribed with high confidence in a medical setting, consistently low quality transcriptions for speakers with African American vernacular). Several team members spend most of their time chasing problems, but debugging remains challenging and every fixed problem surfaces three new ones. Unfortunately, unless a customer complains, the team has no visibility into how the model is really doing. They also only now start collecting basic statistics about whether customers return.

Most of these challenges are probably not entirely surprising and most are not unique to projects with machine learning. Still, this example illustrates how going from an academic model or prototype to a production system is far from trivial and requires substantial engineering skills.

For the transcription service, the machine-learned model is clearly the essential core of the entire system. Yet, this example illustrates how, when building a production ready system, there are many more concerns beyond training an accurate model. Even though the website for uploading audio files, showing transcripts, and accepting payment, as well as the backend for queuing and executing transcriptions may be fairly standard, they involve substantial engineering effort and different forms of expertise; different from those that are required for model training.

In this book, we will systematically cover engineering aspects of building production systems with machine-learning components, such as this transcription service. We will argue for a broad view that considers the entire system, not just the model. We will cover requirements analysis, design and architecture, quality assurance, and operations, but also how to integrate work on many different artifacts involving team members with different backgrounds in a planned process, and how to make sure that responsible engineering practices to achieve, for example, fairness, safety, and security are not ignored. Many of the challenges discussed above can be anticipated, and some upfront investment in planning, design, and automation can avoid many problems down the road.

Data Scientists and Software Engineers

Building production systems with machine-learning components requires many forms of expertise. Among others,

we need the business skills to identify the problem and build a company,
we need domain expertise to understand the data and frame the goals for the machine-learning task,
we need the statistics and data science skills to select a suitable machine-learning technique and train a model,
we need the software engineering skills to build a system that integrates the model as one of its many components,
we need user-interface design skills to understand and plan how humans will interact with the system (and the model and its mistakes),
we may need system operations skills to handle deployment, scaling and monitoring of the system,
we may want help from data engineers to extract, move and prepare data at scale,
we may need legal expertise from lawyers who check for compliance with regulations and develop contracts with customers,
we may need specialized safety and security expertise to ensure the system does not cause harms to the users and environment, and does not disclose sensitive information,
we may want social science skills to study how our system could affect society at large, and
we need project management skills to hold the whole team together and keep it focused on delivering a product.

In this book, to keep a manageable scope, we particularly focus on the roles of data scientists and software engineers working together on the core technical components of the system. We will touch on other roles occasionally, especially when it comes to deployment, human-AI interaction, project management, and responsible engineering, but generally we will focus on these two roles.

Data scientists and software engineers tend to have quite different skills and educational backgrounds, which are both needed for building production systems with machine-learning components. For example:

Data scientists tend to have an educational background (often even a PhD degree) in statistics and machine-learning techniques. They usually prefer to focus on building models (e.g., feature engineering, model architecture, hyperparameter tuning), but also spend a lot of time on gathering and cleaning data. They use a science-like exploratory workflow, often in computational notebooks like Jupyter. They tend to evaluate their work in terms of accuracy on held-out test data, and maybe start investigating fairness or robustness of models, but tend to rarely focus on other qualities like inference latency or training cost, or on deployment in a concrete system.
A typical data science course either focuses on how machine learning algorithms work or on applying machine learning algorithms to model specific problems, typically with a given dataset.
Software engineers tend to focus on delivering software products that meet the user’s needs, ideally within a given budget and time. This may involve steps like understanding the user’s requirements, designing the architecture of a system, implementing, testing, and deploying it at scale, and maintaining and improving it over time. Software engineers often work with limited information and budgets, and apply engineering judgment to navigate tradeoffs between various qualities, including usability, scalability, maintainability, security, development time, and cost.
A typical software engineering curriculum covers requirements engineering, software design, quality assurance (e.g., testing, test automation, static analysis), but also topics like distributed systems and security engineering.

The distinctions above are certainly oversimplified and overgeneralized, but they characterize many of the distinctions we observe in practice. We certainly do not argue that one group is superior to the other, but that they indeed have different expertise that are both needed within a project to build a production system with machine-learning components. For our transcription scenario, we will need data scientists to build the transcription models that form the core of the application, but also software engineers that build and maintain a product around the model.

There may be some developers that have a broad range of skills that include data science and software engineering. These types of people are often called “unicorns,” since they are rare or even considered mythical (see chapter on interdisciplinary teams). In practice, most people specialize in one area of expertise. Indeed, many data scientists report that they prefer modeling but do not enjoy work on infrastructure, automation, and building products. In contrast, many software engineers have developed an interest in machine learning, but research has shown that without formal training, they tend to approach machine learning rather naively (e.g., little focus on feature engineering, rarely testing for generalization, thinking of more data and deep learning as the only next steps when stuck with low accuracy models).

In practice, we need to bring people with different expertise who specialize in different aspects of the system together in interdisciplinary teams. However, to make those teams work, team members from different backgrounds need to be able to understand each other and appreciate each other’s skills and concerns. This is one of the key goals of this book: Rather than comprehensively teaching software engineering skills to data scientists or comprehensively teaching data science skills to software engineers, we will provide a broad overview of all the concerns that go into building production systems involving both data science and software engineering parts. We hope this will provide enough context that data scientists and software engineers appreciate each other’s contributions and work together, thus educating T-shaped team members (an idea we will explore in more detail in the chapter on interdisciplinary teams).

Machine-Learning Challenges in Software Projects

Within the software engineering community, there is still an on-going debate on whether machine-learning fundamentally changes how we engineer software systems (beyond that some functions are now learned from data rather than coded manually) or whether we just need more of the same engineering practices that have long been taught to aspiring software engineers.

Let us look at three challenges, which we will explore in much more detail in later chapters of this book.

Lack of Specifications

In traditional software engineering, abstraction, reuse, and composition are key building blocks that allow us to decompose systems, work on them in parallel, test units in isolation, and put building blocks together for the final system. However, a key requirement for such decomposition is that we can come up with a specification of what each component is supposed to do, so that we can test against that specification and that others can rely on the component knowing only the specification but not the details about how the component is implemented (this also allows us to work with opaque components where we do not have access to the source or do not understand the implementation).

/**
 * compute deductions based on provided adjusted
 * gross income and expenses in customer data.
 *
 * see tax code 26 U.S. Code A.1.B, PART VI
 *
 * Adjusted gross income must be positive;
 * returned deductions are not negative.
 */
float computeDeductions(float agi, Expenses expenses);

With machine learning, we have a hard time coming up with good specifications. We use machine learning precisely because we do not know how to specify and implement specific tasks. For example, we can say that we learn a function to transcribe audio into text, but it is difficult to provide any description that would be more concrete, that we could use for testing or to provide reliable contracts to the clients of the function.

/**
 * Return the text spoken within the audio file
 * ????
 */
String transcribe(File audioFile);

Machine learning introduces a fundamental shift from deductive reasoning (logic based, applying logic rules) to inductive reasoning (sciency, generalizing from observation). As we will discuss at length (see quality assurance chapters), we can no longer say whether a component is correct, because we do not have a specification of what it means to be correct, but we can only evaluate whether it works well enough (on average) on some test data. We actually do not expect a perfect answer from a machine-learned model for every input, which also means our system must be able to tolerate some incorrect predictions, which influences the way we design and validate systems.

While this shift seems drastic, software engineering has a long history of building safe systems with unreliable components — machine-learning models may be considered as one type of unreliable components. In practice, fully-specified, formal specifications of software components are rare; instead, engineers routinely work with missing or vague textual specifications, compensating with agile methods, communication across teams, and lots of testing, including testing in production. Machine learning may push us further down this route than what’s needed in many traditional software projects, but we already have engineering practices for dealing with the challenges of missing or vague specifications. We will focus on these issues throughout the book, specifically in the quality assurance chapters.

Interacting with the Real World

Most software systems, including those with machine-learning components, often interact with the environment (the real world™). For example, shopping recommendations influence how people behave, self-driving cars operate tons of steel through physical environments, and our transcription example may influence what medical diagnoses are recorded (with potentially life-threatening consequences from wrong transcriptions). Bias in our transcription service could harm certain populations with low quality service or presenting them publicly as incomprehensible. Such systems often raise safety concerns: When things go wrong, we may physically harm people or the environment or cause stress and anxiety.

In addition, machine-learning components are often trained based on data in the real world which may have been previously influenced by prior predictions from machine-learning models, resulting in feedback loops. For example, YouTube used to recommend conspiracy-theory videos much more than other videos, because its models realized that people who watch these types of videos tend to watch them a lot; by recommending these videos more frequently, YouTube could keep people longer on the platform, thus making people watch even more of these conspiracy videos and making the predictions even stronger in future versions of the model (a problem that YouTube only reduced by hard-coding rules around the machine-learned model).

As a system with machine-learning components influences the world, users may adapt their behavior in response, sometimes in unexpected ways, changing the nature of the very objects that the system was designed to model and predict in the first place. For example, users of our transcription service could modify their pronunciation to avoid common mistranscriptions. Through adversarial attacks (see security and privacy chapter), users may identify how models operate and try to game the system, for example, by evading face recognition algorithms or attaching stickers to road signs to crash self-driving cars. User behavior may shift over time, intentionally or naturally, resulting in data drift (see data quality chapter).

Yet software systems have always interacted with the real world before machine learning. Software without machine learning has caused harm, such as delivering radiation overdoses or crashing planes and spaceships. To prevent such issues, software engineers focus on requirements engineering to understand how the system interacts with the environment, to analyze risks, and to design safety mechanisms into the system. The use of machine learning may make this harder, because we are introducing more components that we do not fully understand and that are based on data that may not be neutral or representative. This may make it even more important to take risk analysis, requirements engineering, and monitoring seriously in projects with machine-learning components. We will focus on these issues extensively in the requirements engineering and responsible engineering chapters.

Data Focused and Scalable

Machine learning is often used to train models on massive amounts of data that do not fit on a single machine. Systems with machine-learning components often benefit from having many thousands or millions of users, as the system may collect data from those users, use that data to improve the system, which again may attract more users (known as the machine-learning flywheel, discussed in the testing in production chapter). To operate at such a scale, models are often deployed using distributed computing on top of a cloud infrastructure.

*The Machine Learning Flywheel (graphic by* *CBInsights*)

That is, when using machine learning, we may put much more emphasis on operating with huge amounts of data and at a massive scale, demanding substantial hardware and software infrastructure and causing substantial complexity for data management and deployment.

Yet data management and scalability are not entirely new challenges. Systems without machine learning components have been operated in the cloud for well over a decade and have managed large amounts of data with batch processing (e.g., map reduce) or stream processing infrastructure. However, demands and complexity for an average system with machine-learning components may well be higher than demands for a typical software system without machine learning. We will discuss design and operation of scalable systems primarily in the design and architecture chapters.

From Traditional to ML-Enabled Systems

Our conjecture in this book is that machine learning introduces many challenges to building production systems, but that there is also a vast amount of prior software engineering knowledge and experience that can be leveraged. Few challenges introduced by machine learning are uniquely new, but the use of machine learning often introduces complexity or risk that may require more careful engineering than some traditional software systems.

Rather than considering systems with machine-learning as its own and new category, we see a spectrum from simple to complex and from low-risk to high-risk software systems. We tend to have a good handle on building simple, low-risk software systems, such as content management systems, room scheduling systems, or inventory management systems, but we also have (more expensive) methods for building complex or high-risk software systems, such as airplanes, stock exchange systems, and massively distributed cloud storage. We argue that machine learning tends to push us toward the more complex or more risky end of the spectrum.

Speculation based on our observations: Most systems with machine-learning components tend to fall toward the more complex or more risky end of the spectrum of possible software systems, compared to traditional systems without machine learning.

Overall, while we may get away with sloppy engineering practices for simple and low-risk software systems, complex and high-risk systems push us to adopt more rigorous and more expensive engineering methods (typically with a stronger focus on requirements engineering, risk analysis, design, and quality assurance). While there are also plenty of low-risk machine-learning projects, for most systems with machine-learning components we may need to level up our software engineering practices. We should not pretend that they are easy and simple projects when they are not, but need to acknowledge that they may pose risks, may be harder to test, may need substantially more software and hardware infrastructure, and may be more expensive to maintain. Throughout this book, we give an overview of many of these practices that can be used to gain more confidence in even more complex and risky systems.

Summary

Machine learning has enabled many great and novel applications. With a lot of attention focused on novel machine-learning algorithms or applications, the engineering challenges of transitioning from a model prototype to a production system are often underestimated. In a full product that could be deployed in production, machine-learned models are only one of many components, though often an important or central one. Many challenges arise from building a system around a model, including building the right system (requirements), building it in a scalable and robust way (architecture), ensuring it can cope with mistakes made by the model (requirements, UI design, quality assurance), and it can be updated and monitored in production (operations).

Building production systems with machine-learning components is a truly interdisciplinary effort that requires a wide range of expertise, often beyond the capabilities of a single person. It really requires data scientists, software engineers, and others to work together, to understand each other, and to communicate effectively. This book hopes to help facilitate a better understanding.

Finally, machine learning may introduce characteristics that are different from many traditional software engineering projects, for example, through the lack of specifications, interactions with the real world, or data-focused and scalable designs. Machine learning often introduces additional complexity and possibly additional risks, that call for responsible engineering practices. Whether we need entirely new practices, need to tailor established practices, or just need more of the same is still an open debate — but most projects can clearly benefit from more engineering discipline.

The rest of this book will dive into many of these topics in much more depth, including requirements, architecture, quality assurance, operations, teamwork and process, and responsible engineering practices.