Machine Learning in a Nutshell for Software Engineers
This chapter covers terminology used throughout our Machine Learning in Production course and book. It is not intended as standalone reading. For other chapters see the table of content.
While we expect that most readers of our book are familiar with machine learning basics, in the following, we briefly define key terms to avoid ambiguities. We additionally briefly describe the key concepts from a software-engineering perspective.
Basic Terms: Machine Learning, Models, Predictions
Machine learning is the subfield of artificial intelligence that deals with learning functions from observations (training data). More broadly, artificial intelligence includes many other techniques, including expert systems based on logic and probabilistic reasoning, that do not rely on observations. Deep learning describes a specific class of machine-learning approaches based on large neural networks.
A machine-learning algorithm (sometimes also called modeling technique), implemented in a machine-learning library or machine-learning framework, such as sklearn or Tensorflow, defines the training procedure of how the function is learned from the observations. The learned function is called a model — we often use the term machine-learned model to distinguish it from the many other kinds of models common in software engineering, which are usually manually created, not learned. The action of feeding observations into a machine-learning algorithm to learn a model is typically called model training. The model can compute outputs for inputs from the same domain as the training data — for example a model trained to generate captions for images takes an image and returns the caption. These outputs are often called predictions, and the process of computing the predictions is called model inference.
Notice the difference between the machine-learning algorithm used during model training and the machine-learned model used during model inference. The former is the technique used to create the latter; the latter is what is actively used in a software system.
Technical Concepts: Model Architectures, Model Parameters, Hyperparameters, Model Storage
There are many different machine-learning algorithms. In the context of this book, their internals and theory largely do not matter beyond the insight that they influence determines capabilities and various quality attributes both of the learning process and of the learned model — discussed more in the Model Tradeoffs chapter together with a more technical description of decision trees and deep learning as examples of two very different machine-learning algorithms.
The process of training a model is often computationally intensive (up to years of machine time for large neural networks in deep learning). Models to be learned typically follow a certain basic internal structure specific to that learning algorithm (e.g., if-then-else chains in decision trees or certain sequences of matrix multiplications in deep neural networks) where machine-learning algorithms identify the values of constants and thresholds (e.g., the conditions of the if-then-else statements or the values of the matrices). In deep learning, this structure is called the model architecture. The machine learning algorithm identifies the values of constants and thresholds in the model’s structure; those constants and thresholds are called model parameters in the machine-learning community. Machine-learned models can have hundreds or millions of these learned parameters.
For many machine-learning algorithms, the learning process itself is non-deterministic, that is, repeated learning on the same training data may produce slightly different models. Configuration options that control the learning process itself (e.g., when to stop learning) are called hyperparameters in the machine-learning community. The machine-learned models themselves tend to be side-effect-free, pure, and deterministic functions. For large deep learning models, computing a single prediction can require nontrivial computation effort (millions of floating point computations), for most other models it is fast.
From a software engineering perspective, consider the following analogy: Where a compiler takes source code to generate an executable function, a machine-learning algorithm takes data to create a function. Just like the compiler, the machine-learning algorithm is no longer used at runtime, when the function is used to compute outputs for given inputs. In this analogy, hyperparameters correspond to compiler options.
Machine-learned models are typically not stored as binary executables, but in an intermediate format (“serialized” or “pickled”) describing the learned parameters in a given model structure. The model in this intermediate format can be loaded and interpreted by some runtime environment. This is not unlike Java, where the compiler produces bytecode, which is then interpreted at runtime by the Java virtual machine. Some machine-learning infrastructure also supports compiling machine-learned models into native machine code for faster execution.
Machine Learning Pipelines
Using a machine-learning algorithm to train a model from data is usually one of multiple steps in the process of building machine-learned models, which is typically characterized as a pipeline.
Once the purpose or goal of the model is clear (model requirements), but before the model can be learned, one has to acquire training data (data collection), identify the expected outcomes for that training data (data labeling), and prepare the data for training. The preparation often includes steps to identify and correct mistakes in the data, fill in missing data, and generally convert data into a format that the machine-learning algorithms can well handle (e.g., many learn better on numeric data following certain distributions) — these data preparation activities are called data cleaning and feature engineering.
After the model has been trained with a machine-learning algorithm (model training stage), it is typically evaluated. If the model is deemed good enough, it can be deployed, and is then often monitored in production.
The entire process to develop models is highly iterative, incrementally tweaking different parts of the pipeline toward better models (more details in the Process chapter). For example, if the model evaluation is not satisfactory, data scientists might try different machine-learning algorithms or different hyperparameters for them, might try to collect more data, or might prepare data in different ways.
Most steps of the machine-learning pipeline are implemented with some code. For example, data preparation is often performed with programmed transformations (e.g., removing all outlier rows, normalizing values in a column) rather than manual changes to individual values. Depending on the machine-learning algorithm, training is typically done with very few lines of code calling the machine-learning library to set up the model architecture and hyperparameters and pass in the training data. Deployment and monitoring may require substantial infrastructure, as we will discuss.
During exploration, data scientists typically work on code snippets of various stages, one snippet at a time. Code snippets tend to be short and rely heavily on libraries for data processing and training. Computational notebooks like Jupyter are common for exploratory development with code cells for data cleaning, feature engineering, training, and evaluation. The entire process from receiving raw data to deploying and monitoring a model can be automated with code. This is typically described as pipeline automation.
On Terminology
Machine learning emerged from ideas from many different academic traditions and terminology is not always used consistently. Different authors will refer to the same concepts with different words or even use the same word for different concepts. This becomes especially common when crossing academic disciplines or domains. Above and in later chapters of the book, we introduced concepts explicitly with common and internally consistent terms, pointing out potential pitfalls (e.g., the meaning of “parameter” and “hyperparameter”), and we try to use terms consistently throughout this book. We generally try to resolve ambiguity where terms may have multiple meanings, but avoid listing many alternative terms for the same concept to avoid additional confusion; for example, learning and training are often used interchangeably and machine-learning algorithms may be referred to as modeling techniques.
Summary
It is important to distinguish the machine-learning algorithm that defines how to learn a model from data, usually as part of a pipeline, from the learned model itself, which provides a function typically used for predictions by other parts of the system. Most steps of the machine learning pipeline are often performed with relatively limited amounts of code and the entire pipeline can be automated.
Further Readings
- There are many books that provide an excellent introduction to machine learning and data science. For a technical and hands-on introduction, we recommend Géron, Aurélien. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems”. 2nd Edition, O’Reilly Media, 2019.
- Many machine-learning books additionally focus on how machine-learning algorithms work internally and the theory and math behind them. Interested readers might seek dedicated text books, such as Flach, Peter. Machine Learning: The Art and Science of Algorithms That Make Sense of Data. Cambridge University Press, 2012 or Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.