Building Machine Learning Products with Interdisciplinary Teams
This chapter covers the “Fostering Interdisciplinary Teams” lecture of our Machine Learning in Production course (slides, video). For other chapters see the table of content.
As discussed throughout this book, building production systems with machine learning components requires interdisciplinary teams, combining the different expertise of different team members.
Conflicts between team members with different backgrounds and roles are common and found in many projects. It is common that members from one group dominate, and members from another group feel left out or are involved only very late. Throughout this book, we emphasized the differences and challenges between data scientists and software engineers and how either can feel underappreciated or ignored when building ML-enabled software systems. DevOps is frequently discussed to alleviate similar tensions between developers and operators. Similarly, designers and user experience experts often complain that they are brought in too late to software projects, and security experts and lawyers similarly often complain that they are only involved after the fact when disaster strikes or when a checkbox needs to be checked rather proactively where they could actually shape the product in a meaningful way.
In this chapter, we take a closer look at the different roles involved in building production systems with machine-learning components and how to manage some conflicts that can emerge in teams, especially interdisciplinary ones. We will also explore what we can learn from successful teams in general and DevOps culture as a specific example.
Scenario: Fighting Depression on Social Media
Consider a large video-heavy social media company that gets serious about concerns of how social media use may foster addiction and depression among young users and they want to develop interventions for improving mental health and benefiting society at large. There are many possible design solutions, such as not showing how often other posts have been liked or using natural-language analysis to detect and discourage toxic interactions, but let us assume the team has the ambitious plan of trying to predict whether users develop signs of depression based on their behavior, media consumption, and postings on the side. For depression prognosis a set of sentiment analysis techniques could be used on video and text and likely more specialized techniques need to be developed. Then there is a whole set of interventions that the company potentially considers once signs of depression are detected, including changing news feed curation algorithms to filter negative content or certain ads for these users, to pop up offers with information material about depression, to design small-group features, or even informing parents of young users.
This is a highly nontrivial and ambitious project. Even though the company has many experienced managers, software engineers, and data scientists they expect to rely heavily on external help from medical experts in clinical depression, researchers that have studied interventions for depression, and ML researchers at top universities. There are many challenges in depression prognosis on a technical level, but also many design challenges in how to present results, how to design interventions, and how to mitigate risks and harms from false prognoses. This project will likely involve dozens of people with different expertise, within a large organization and with external collaborators.
Unicorns are not Enough
Discussions of skills and teams for bringing machine learning into production frequently talk about unicorns, those rare developers that combine all the needed expertise and experience in statistics, in engineering skills, and the domain of interest (e.g., clinical depression). Depending on who you talk to, unicorns also have business expertise, people skills, or can do all the data collection.
As the term unicorn suggests, such multi-talented people are rare or might not even exist. Many developers will know some people or will have heard of some that fit the description to some degree, but they can be extremely difficult and expensive to hire. Often a better strategy is to go with an interdisciplinary team where multiple people with expertise in different topics work together.
Working as a team or with multiple teams is often a necessity. Teams are necessary for division of labor and division of expertise for almost all nontrivial systems. Division of labor is important as systems become too big for a single person to build, where instead we want to divide the work so that multiple people can contribute. Division of expertise is important, as a single person is rarely an expert in all relevant topics or even has the cognitive capacity to develop deep expertise in all relevant topics. Again, teams can bring together people with different expertise.
Many skills are associated with more or less established roles and job profiles. For example, throughout much of this book we have contrasted the roles of data scientists and software engineers. There are of course further specializations within each of these, such as data architect, statistician, NLP experts on the machine learning side and requirements engineers, software architects, and software testers on the software engineering side. Specializations can be fluent and role titles and descriptions often evolve over time. Beyond data scientists and software engineers many other skills associated with other roles are needed in most ML-enabled systems. For example, in the depression project, we might want to work with clinical experts in depression to provide expertise and evaluate the system, with social workers and social scientists who design interventions, with designers who propose a user interaction design, with operators who deploy and scale the system in production, with a security expert who audits the system, and probably with a project manager who tries to keep things together.
All this points to the fact that teams are inevitable. As systems grow in size and complexity, the times where systems are built by a lone genius are largely over. Teams are needed, and those teams need to be interdisciplinary.
Conflicts Within and Between Teams
Conflicts can arise both within and between teams. Within teams, members often work together more closely, but conflicts still arise when members have conflicting goals (often associated with different roles) or simply do not understand each other well. Between teams, collaboration is more distant and conflicts can arise more easily. Conflicts often arise when the responsibilities are not clear, when teams are sorted by role but do not communicate much with each other (e.g., a siloed software engineering team versus a siloed data science team), when teams have conflicting goals or incentives, and when power structures in an organization let one team define priorities without listening to the needs of others.
A typical ML-enabled software project involves many different roles, each with their own specialties, biases, and goals. It is not surprising that conflicts abound when not managed carefully. Examples we have seen include:
- Managers making promises to customers without involving data scientists to check for feasibility or any awareness of needed data quality and quantity.
- Data scientists not taking engineering work seriously and ignoring needs to operate the system.
- Software engineers not involving data scientists in planning meetings.
- Organizations paying little attention to teams collecting, entering, and labeling data until conflicts over data quality arise.
- Data scientists focused on solving an interesting and challenging machine-learning problem while ignoring that a much simpler solution meets the business goals.
- User interface designers asked to create a user interface after problematic data collection and interaction modes were already decided, leaving little room for designs that reduce consequences of mistakes.
- Domain experts not being available to data scientists until poor design decisions and data understanding render the models useless.
- Legal experts only being consulted last minute before the release where they have little influence to shape privacy and fairness decisions.
- Data engineers, infrastructure teams, and operators blamed for problems but provided with little resources and advance notice.
- Product owners not sharing data with data science teams over concerns about data security and privacy or over priority conflicts.
Focusing more narrowly on data scientists and software engineers, let us take the example of two anonymized teams, from organizations we have interviewed.
In organization A, a team of four data scientists, two of which have a software engineering background, builds and deploys a model to be used in a product (quality control for a production process) within a different company. Interactions between the customer and the data science team are rather formal as they span company boundaries and involve formal contracts. The model is integrated into a product by the engineering team of the client, with limited experience in machine learning — hence work on the final system is split across two teams in two organizations. The data science team is given data and a modeling task and has no power in shaping data collection or model requirements; their decisions are restricted to local modeling decisions. The data scientists report that they need to invest a lot of time in educating their client about the importance of data quality and data quantity and that they have to fight back against unrealistic expectations. At the same time, they struggle getting access to domain experts for the provided data in the client’s team. As the provided data is proprietary and confidential, the client imposes severe restrictions on how data scientists can work with data, limiting their choices and tools. The data scientists had to manage a lot of complex infrastructure themselves and would have preferred to do less engineering work, but found it hard to convince their management to hire more engineers. Since data scientists were not given any requirements about latency, explainability, fairness, or robustness they did not feel responsible for exploring or testing those qualities.
In organization B, an interdisciplinary team with eight members works on an ML-enabled product (health domain) for a government client. The team is composed mostly of software engineers, with one data scientist. Even though it is technically a single team with low internal communication barriers, the team members form internal groups within the team. The sole data scientist reports feeling isolated and not well integrated into the decision making process by other team members and has little awareness of the larger system and how the model fits into it. Similarly, team members responsible for operations report poor communication with others in the team and ad-hoc processes. Software engineers on the team find it challenging to wrap and deploy ML code with limited experience in machine learning and feel like they need to educate others on the team about code quality. All team members focus on their own parts with little effort toward integration testing, online testing, or any joint evaluation of the entire product. The team is responsible for identifying suitable data themselves and building both the model and the product; communication with the client is rare and they have little access to domain experts or end users. Multiple team members report communication challenges due to different backgrounds among the team members and a lack of documentation, and they find it challenging to plan milestones as a team. The client’s requirements are neither well defined nor stable, leading to lots of unplanned changes. The client often makes requests that seem unrealistic with the data available, but the data scientist is not involved when the team lead or software engineers communicate with the client, so they cannot push back on such requests early. While the client is eventually happy with the functionality, the client does not like the user interface of the product, which has been neglected in all discussions and requirements collection.
In the following we explore multiple factors that underlie common team problems and conversely strategies to strengthen interdisciplinary teams by designing team structures and processes deliberately.
Coordination Costs
As teams grow in size, so do coordination costs and the need for structure. For example, team members may complain spending too much time in meetings and having to coordinate with too many others. While a team with 3 members can usually easily coordinate their work, a team with 10 members is harder to coordinate and a project with more than 20 people, as likely needed for the depression project, definitely needs to be organized as multiple smaller teams. Process costs are well understood and dedicated management practices are designed to reduce them.
The key observation is that the number of communication links over which people may need to coordinate grows quadratically with the number of members in a team, but that introducing subgroups that coordinate through representatives can reduce that number.
Brook’s law, one of the famous observations in software engineering saying that “adding more people to a late software project makes it later,” can be partially explained this way: In addition to the onboarding effort for new team members, each additional team member adds coordination overhead that slows everybody down. Overall, there are many observations that large teams are not much faster than small teams and at the same time introduce more bugs.
Large teams (29 people) create around six times as many defects as small teams (3 people) and obviously burn through a lot more money. Yet, the large team appears to produce about the same amount of output in only an average of 12 days’ less time. This is a truly astonishing finding, though it fits with my personal experience on projects over 35 years. — Phillip Amour, 2006, CACM 49:9
There are many different ways to structure a project into multiple teams and manage communication patterns, beyond the scope of this book. Almost always they aim for teams with 3 to 9 team members each, with some coordination structure between teams, possibly but not necessarily with some hierarchy.
Socio-technical congruence
A key organizational principle is that the software structure should align with the team structure, which is known as socio-technical congruence or Conway’s law. Ideally, a project is broken down into components at the architectural level and different teams will work on different components. Conway’s observation was that this will often happen naturally, as developers design system components to align with the existing team structure (informally “if you have three groups working on a compiler, you’ll get a 3-pass compiler; if you have four groups working on a compiler, you’ll get a 4-pass compiler”).
Socio-technical congruence is important because communication within a team is much easier than communication across teams. For example, team members may share progress daily in standup meetings, but may only learn about important updates from other teams through their managers or from mailing lists with a low signal-to-noise ratio. Cases where system structure does not align with team structure, that is, a single component is maintained by developers from different teams, have been shown to be associated with slower development and more defects in many studies.
In projects with machine-learning components, the need for diverse expertise combined with socio-technical congruence make a strong argument for creating interdisciplinary teams to work on individual components rather than sorting people by speciality into teams (called siloing) where then multiple teams would have to coordinate on the same component. In our example, we likely want to have software engineers be part of the team that develops the depression prognosis model and the team that integrates the model in the user interface will likely benefit from having machine-learning expertise on their team to better anticipate and mitigate mistakes as well as design the telemetry and monitoring system.
Not every team can justify hiring a full time specialist for every problem, especially when specialists are rare or involved only for short periods of time. Teams should be strategic about considering bringing in software engineers and domain experts for teams building machine-learning components in a project or at least establishing clear communication links to recruit outside help when needed. We will discuss different forms of team organization below in the context of conflicting goals.
Information hiding and interfaces
Communication within a team is usually somewhat informal and less problematic (e.g., standup meetings, sharing offices), though teams with poor internal cohesion can fracture into smaller groups as we have seen in organization B above. In contrast, communication across teams usually needs more planning. Team members that represent the team and coordinate with representatives of other teams (often people with some managerial responsibilities) carry a higher burden of identifying what needs to be communicated and coordinated across teams. They need to understand the concerns of the entire team as it interfaces with others. For example, the representative in a mostly engineering-focused team also needs to understand the concerns of the data scientist, user interface designer, or operator on their team when interfacing with others to negotiate requirements.
In general, structuring a group into smaller teams reduces coordination overhead, but risks that important information is lost between teams. When individual team members notice a need to coordinate, they may try to reach out directly to counterparts in other teams without involving their managers (if organizational structures and culture permit it). However, when structures erode and team members from one team have to coordinate with many members from many other teams, we are back to the communication overhead of having only a single very large team. Hence it is important to foster good communication patterns and having representatives who can speak for the concerns of the entire team.
At a technical level, the traditional strategy to limit the need for communication is information hiding. The idea is that a module is represented by an interface, and others only need to know that interface, but not any implementation details behind that interface. Ideally, teams responsible for different modules only need to coordinate when deciding on the interfaces or when changing interfaces, for example, to accommodate changing requirements.
Information hiding behind interfaces is the key reason why we argued for modular implementations throughout the architecture and design chapters: If we can isolate non-ML components from ML components (including inference and pipeline components) and document their interfaces then separate teams can work on different components without permanent coordination. As discussed, though, practices for defining and documenting interfaces are not well developed when it comes to machine-learning components, explaining why friction often emerges at the interface between data science and software engineering teams.
Also information hiding relies on interfaces being stable. It is less likely that teams can agree on and write down stable interface specifications for machine-learning components during an early design phase. Given the exploratory and iterative nature of data science (see chapter Data Science and Software Engineering Process Models), teams often explore what is possible as the project evolves. Teams might need to renegotiate over time, such that the interface emerges and evolves as the project matures. For example, in the depression prognosis project, there might be a long exploratory phase where a data science team tests the feasibility of detecting depression with various models and from different data, before they can even consider writing an interface definition or documenting requirements more deliberately; data scientists may explore to what degree explanations can be provided for predictions only late in the project when pushed by user interface designers and may assess fairness only in an audit once a prototype with acceptable accuracy has been released; they may again redesign parts of the model and assumptions about data and other components only when deployment at scale is planned or as privacy concerns are raised by lawyers. It seems unlikely that multiple teams can agree on stable interfaces upfront and then work largely in isolation in this project.
The dark side of divide and conquer
Information hiding reduces communication and coordination work by focusing coordination on the interfaces of components and then letting teams work on each component largely independently. Friction commonly occurs when interfaces need to be renegotiated and changed, which often happens when the initial interfaces were not documented clearly and when important considerations were not considered while designing the interfaces. Friction very commonly stems from coordinating system-level concerns that cut across multiple components, such as usability and safety of a system.
Compartmentalizing work into components often seems appealing only to realize later that nuances got lost when interfaces were defined prematurely or incompletely without a full understanding of a system. In our depression prognosis example, it would be tempting to delegate data acquisition and cleaning to a separate team, but defining this data collection task well requires an understanding of social media, of depression, of the data scientists’ need when building models, of lawyers about how personal data can be legally processed in various jurisdictions, of user interface developers who understand what telemetry data can be realistically collected, and others. It is unlikely that a data collection component can be well specified in an early planning phase of the project without cross-team coordination and substantial exploration. Starting with a poorly defined task provides the illusion of being able to make progress in isolation, only to discover problems later. On the positive side, even a preliminary understanding of interfaces will allow the data collection team to identify with what other teams to engage when coordination and renegotiation of the interface becomes necessary.
This notion of compartmentalizing work can lead to a diffusion of responsibility, where local interfaces do not indicate a need to consider safety, fairness, privacy, usability, operational efficiency, and other system-level concerns. Each team may assume that others will take care of this. Even worse, problems may only manifest when the system is deployed and interacts with the environment in a specific context. Team members concerned about these system-wide issues can have a hard time coordinating across many teams when the need for this coordination is not apparent from the component interfaces. For example, in the depression prognosis project it may be difficult to identify which team is responsible for fairness of the system toward different demographics; we can assign a fairness audit as responsibility to the model team but might easily ignore how usage patterns of teen girls combined with the implemented telemetry and reporting interface may lead to skewed outcomes discriminating against certain populations — fairness is cutting across the entire system and cannot be assessed in any single component, yet it may not raise to a level of serious concerns in most components and may be mostly ignored when defining interfaces for components.
Overall, projects need to find a way to balance system-wide concerns with the need to divide work into individual components that cannot realistically consider every possible interaction among components and their design decisions in their interfaces. Project managers and system architects who oversee the definition of components and the negotiation of interfaces have a particular responsibility to pay attention to such concerns. It is often a good idea to bring in specialists in early stages when considering how to divide the work.
Awareness
A second, often complementary, strategy to coordinate work across teams is to foster awareness. Here, information is not hidden but actively broadcast to all who might be affected. Examples of activities that foster awareness include cross-team meetings, sending emails to a mailing list or announcing changes on Slack, but also observing an activity feed of the version control system (e.g., on GitHub) and subscribing to changes in an issue tracker. Broadcasting every decision and change typically quickly leads to information overload, not that different from the process cost in large teams, where everybody spends a lot of time in meetings and where important information may be easily lost in the noise. However, more targeted awareness mechanisms can be effective, for example, informing other teams proactively about important changes or setting up automated notifications for code changes. Filtering can be done by the sender (e.g., informing only select contacts in other teams) or the receiver (e.g., subscribing to the right mailing list or setting up automated alerts on keyword matches). For example, a developer who maintains the user frontend may subscribe to code changes in the machine-learning pipeline that relate to training data to observe whether additional data is used that may suggest changes to explanations shown in the user interface or end-user documentation; a data scientists planning to update a model’s internal architecture anticipating changes to inference latency may proactively reach out to all users of the model to see whether the change has any implications for them.
Personal awareness strategies, such as opening informal cross-team communication links often rely on personal relationships and are not robust when team members leave, often leading to subsequent collaboration problems. It can be a good idea to formalize such communication links, such as documenting whom to contact for various issues or creating topic-specific mailing lists. For example, in the depression project, the data scientists may maintain a list of contacts in the engineering teams of the social media platform who can explain (and possibly change) how data about user interactions is logged in the database, especially if that is not documented anywhere.
Awareness is less effective if teams span multiple organizations with formal boundaries and contracts. Team members may still reach out informally to teams in other organizations, such as data scientists trying to find domain experts in the client’s organization, but it is usually more challenging to establish communication channels and even more challenging to sustain them to establish awareness.
Most multi-team software projects use some combination of information hiding and awareness. Both have their merits and limits. It seems, at least with current common practices, information hiding might be less effective as a strategy when it comes to machine learning components because of the difficulty of clearly documenting stable interfaces. Hence teams in ML-related projects should consider either investing into better practices for stabilizing and documenting interfaces or putting in more thought into deliberate awareness mechanisms, to keep relevant people in other teams updated on developments without overwhelming them with meetings or messages.
Conflicting Goals
A well known dysfunction in (interdisciplinary) teams is that different team members have different goals, which may be in conflict with each other. In traditional software development, developers may try to get functionality done as quickly as possible, whereas testers aim to find defects, operators aim to use computational resources efficiently, and compliance lawyers try to avoid the company getting sued over privacy violations. With data scientists, another group of team members enter the picture with their own goals, typically to maximize accuracy of a model or to find new insights in data.
In our depression prognosis example, data scientists and lawyers may have very different preferences on what kind of data can be analyzed to further their respective goals, operators may prefer smaller models with lower latency, and user experience designers may value accurate confidence estimates and meaningful explanations of predictions more than higher average accuracy across all predictions to design mitigation strategies for wrong predictions. If a team member optimizes only for their own goals, they may contribute to the project in ways that are incompatible with what others want and need, for example, learning depression prognosis models that are expensive to use in production and produce false positives that drive away users or even cause some users to seek therapy unnecessarily. Different team members may be evaluated by different metrics in performance reviews, e.g., software engineers by features shipped, data scientists by accuracy improvements achieved. Team members in some roles commonly struggle to quantify their contributions for reviews, such as fairness advocates and lawyers, since their work manifests usually only in the absence of problems. Team members each optimizing only for their own goals commonly leads to inefficient outcomes for the system. To improve teamwork, members of a project, often across multiple teams, need to communicate to resolve their goal conflicts.
T-shaped people
To resolve goal conflicts, it is first necessary to identify them. Here it is useful if team members are transparent about what they try to achieve in the project (e.g., maximize prediction accuracy, ship the feature as fast as possible, minimize system downtime in operation, minimize legal liabilities, maximize visibility for marketing) and how they measure success. Team members need to make an effort to understand the goals of other team members and why they are important to them. This often requires some understanding of a team member’s expertise, e.g., their concerns about modeling capabilities, their concerns about presenting results to users, their concerns about operation costs, or their concerns about legal liabilities.
While, as discussed above, it is unlikely that team members are experts in all of the skills needed for the project (“unicorns”), it is very useful if they have a basic understanding of the various concerns involved. For example, a data scientist does not need to be an expert in privacy laws, but they should have a basic understanding to appreciate and understand the concerns of a privacy expert. Here the idea of T-shaped people (mentioned already in the introductory chapter) is a powerful metaphor for thinking about how to build interdisciplinary teams: Instead of hiring people who just narrowly specialize in one topic (experts, I-shaped) or generalists who cover lots of topics but are not an expert in any one of them, T-shaped people combine the expertise in one topic with general knowledge of other topics. T-shaped people are good team members, because, in addition to bringing deep expertise to a project (e.g., in machine learning), they also can communicate with experts in other fields and appreciate their concerns and goals. For example, a T-shaped data scientist in the depression prognosis project may learn domain expertise about clinical depression, may have basic software engineering skills (as discussed throughout this book), may know the basics of distributed systems, and may know how to speak to an expert on privacy laws. Ideally, teams are composed of T-shaped members to cover all relevant areas with deep expertise but also allow communication between the different specialists.
If team members understand each other and their goals, it is much easier to compromise as a team and agree on goals for the entire project as a team. For example, the team may identify that certain privacy laws impose hard constraints that cannot be avoided and the data scientists can identify which requests from the software engineers (e.g., explainability) are important for the system success when weighing them against other qualities such as prediction accuracy in their work; similarly software engineers who understand basics of machine learning may have a better understanding of what requests for explainability and latency are realistic for the system and how to anticipate mistakes of the model better. Understanding each other will also help to explain contributions to project success in performance reviews, even when they are not easily reflected in or even negatively impact own metrics, such as a data scientist compromising for lower accuracy (their typical own success metric) to support explainability needed by other team members (benefiting their metrics and overall project success) or fairness considerations for the system.
Generally, it is useful to expose all views and consider conflicts as a useful part of the process that warrants team-wide discussions. However it is equally important to come to a decision as a team, to commit to specific constraints and tradeoffs, and to commit to this decision. Contributions of team members should be evaluated based on that compromise that codifies the goals for the entire team. Writing down the agreement is essential.
Team organization
In large organizations there is an additional challenge of how to organize multiple project teams in an organization and how to allocate experts to teams, which has been extensively studied in the management literature. Grouping people by expertise, such as a data science team and a security team, allows experts to work together with others in their field and share expertise, but risks isolating the experts from others in the organization, reducing collaboration across disciplines. We often speak of siloed organizations.
Project focus versus information sharing. Management literature typically contrasts two common forms of organization: matrix organization and project organization. In a matrix organization, experts are sorted into departments and they contribute to projects across departments (e.g., IT department, sales department, data science group). Each department has its own management structure, own goals, and own hiring priorities. Members of the different departments are lent to specific projects — -they often work on one or multiple projects and also on tasks in their home departments. Each project member reports both to the team lead and to their home department. In contrast, in a project organization, team members belong exclusively to a project and do not have an additional home department; the organization may not have any departments. Each project recruits team members from the pool of employees in the organization or hires additional team members externally; if the project is completed team members join other projects.
Matrix organizations foster information sharing across projects and can help to develop deep expertise in each department. For example, data scientists working on different projects may communicate to share their latest insights and suggest interesting recent papers to their colleagues. However, in a matrix organization, individuals often report to both the project and the department, setting up conflicts. In a project organization, team members can focus exclusively on the project, but specialists may have fewer opportunities to learn from other specialists in their field working on other projects and they may reinvent and reimplement similar functionality in several projects independently. This tradeoff between focus in projects and information sharing across teams is well known; for example Mantle and Lichty report their observations at the video games company Brøderbund about this conflict:
“As the functional departments grew, staffing the heavily matrixed projects became more and more of a nightmare. To address this, the company reorganized itself into “Studios”, each with dedicated resources for each of the major functional areas reporting up to a Studio manager. Given direct responsibility for performance and compensation, Studio managers could allocate resources freely.
The Studios were able to exert more direct control on the projects and team members, but not without a cost. The major problem that emerged from Brøderbund’s Studio reorganization was that members of the various functional disciplines began to lose touch with their functional counterparts. Experience wasn’t shared as easily. Over time, duplicate efforts began to appear.”
To address this tradeoff many hybrid forms have been explored to balance the different team issues. For example, informal inhouse communication networks among specialists on different teams can provide opportunities for exchange and shared learning, such as through inhouse meetups or invited talks.
Allocation of specialists. Another drawback of a pure project organization is that some projects may also need expertise without being able to justify a full-time position in that role, wanting to bring in specialists only for a short duration or part time. For example, data scientists for the depression prognosis system may want to consult with fairness specialists occasionally early on and later recruit them for an audit, when the project cannot afford a full-time position dedicated to fairness and none of the team members has sufficient expertise for the task (or interest or aptitude to gain that expertise). Research shows that many machine-learning projects struggle particularly to allocate resources for responsible machine learning, including fairness work, in practice.
When considering how to allocate rare expertise across project teams in machine-learning projects, we can learn from other fields. The software engineering community went through similar struggles about two decades earlier with regard to how to allocate security expertise in software teams:
- Security is a quality that is affected by all engineering activity. Some engineers were very invested into making software secure, but for most this was not a focus or even perceived as a distraction — something slowing them down when pursuing their primary goal of developing new features. Similarly, these days fairness and responsible engineering more broadly are widely seen as important, but only few team members really focus on them, often without much structural support.
- Attempts to educate all engineers with deep knowledge in security were met with resistance. Since security is a broad and complicated topic, becoming an expert requires significant training and continuous learning is needed as the field evolves. Engineers acknowledged that security is important, but usually did not consider it a primary concern in their own work. Security workshops were perceived as compliance tasks with little buy in. Similarly, fairness is a deep and nuanced issue hard to capture in a short training; becoming an expert requires significant investment that many team members will not perceive as contributing much to their main work and just to checkbox compliance.
- A more successful strategy to improve software security was to provide basic training on security that conveys just the key concerns and principles, but not cover all technical details. This established security literacy that developers could become aware of when in their work security might become important and when and how to bring in help from experts in the organization. In addition, certain security practices could be enforced with tools (e.g., adopting more secure libraries, automated checks) or as part of the process (e.g., require signoff on architectural design by a security expert). Similarly, creating awareness for fairness issues and bringing in experts as needed is likely a better strategy than enforcing in-depth training for everybody and compliance (involving experts) can be integrated in processes. Note how this approach mirrors the idea of T-shaped people who know concerns and when and how to ask for help from experts, especially when resources are limited and not each team can afford to hire an expert themselves.
For developing production ML-enabled systems, it is likely that a team would need multiple specialists in software engineering, data science, and operations and siloing off those roles will likely foster conflicts that communication by T-shaped people within a team could much more easily resolve. For example, the depression prognosis team might not hire a privacy lawyer just for this project, but may develop an understanding when and whom to contact from an outside organization or, if it exists, in the organization’s own legal department.
Groupthink
Another common dysfunction in teams is groupthink, the tendency of a group to minimize conflict by (1) avoiding to explore alternatives, (2) suppressing dissenting views, and (3) isolating the group from outside influences. Groupthink leads to irrational decision making where the group agrees with the first suggestion or a suggestion of a high-status group member. Symptoms include teams that overestimate their ability and closed mindedness in general with pressure toward uniformity and self-censoring. Such a mindset can limit innovation and ignore risks.
In our running example, the depression prognosis team may explore deep learning to identify depression in social media use without considering any other learning techniques, just because a manager or experienced data scientist suggests it. Even though some team members may think that other techniques would be more suitable (e.g., less expensive in operations, easier to debug and explain, equally accurate), they do not voice their opinion because they do not want to be perceived as contrarian, because they feel they have less expertise, or because they fear a bad performance review when speaking against the team lead. Similarly, if software engineers dominate the project and do not value contributions from operators (or even invite them to meetings), operators may stop voicing concerns or making alternative suggestions. The team may then spend months of engineering effort on inefficient solutions before only revisiting the decision late into the project once drawbacks threaten the launch of the entire product. As a final example, some team members may have concerns about fairness or safety issues with the depression prognosis system, fearing that it may perform poorly for underrepresented demographics or lead to severe stress on false positives, but they do not analyze this further up because team leads always discard such concerns whenever mentioned.
Groupthink has been studied in many contexts and is common across many teams and many common causes have been identified.
- First, high group cohesiveness and homogeneity makes it less likely that alternative views are identified and explored, for example, where all data scientists have specialized in the same machine-learning techniques and maybe have even taken the same classes at the same university. While diversity in teams is not a sufficient solution in itself, more diverse teams are more likely to bring different view points to an issue and detect possible issues; for example, team members who previously faced discrimination may be more sensitive to potential fairness issues in depression prognosis.
- Second, organizational structure can isolate leadership or entire teams from feedback mechanisms or develop a culture of discouraging exploration of competing ideas. Interdisciplinary teams already bring different perspectives at a technical level, but it is also important to make sure that all team members can contribute and voice their opinions and concerns. For example, siloing off operators may lead to a team that prescribes an ML platform and deployment mechanisms to data scientists but does not listen to requests for specialized use cases or simpler, more agile processes.
- Finally, situational contexts can drive groupthink, such as stressful external threats, recent failures, and moral dilemmas. These are not unlikely in the competitive field of ML-enabled systems and the evolving and poorly regulated nature of AI ethics. For example, if the social media company is under public scrutiny because of lots of media attention of recent high-profile depression-related suicides, the team might cut corners and deploy a model knowing that it is of low accuracy and that the mitigation strategies in place are not yet working well to protect users from the stress (and potential overtreatment and social stigma) of false positives labeling them as depressed.
Many interventions have been suggested and explored for groupthink. In general, it is important to foster a culture in which team members naturally explore different viewpoints and discuss alternatives. Solutions include: (1) selecting and hiring more diverse teams, (2) involving outside experts, (3) ensuring all team members are asked for their opinions, (4) having a process in which high-status team members always speak last, (5) actively moderating team meetings, and (6) always exploring a second solution. Techniques such as devil’s advocate (having team members explicitly voice contrary opinions for the sake of debate, without labeling them their own) and agile development techniques such as planning poker and on-site customers also help avoiding certain groupthink tendencies.
Learning from DevOps and MLOps Culture
Conflicts between teams and team members with different roles and goals are common and well studied. In the software engineering community, approaches to improve teamwork are particularly well explored between developers and operators on software teams in the context of what is now known as DevOps. The DevOps approach, and the more recent extension toward ML pipelines as MLOps, also provide a promising template for how to consider interdisciplinary collaboration in production machine-learning projects.
A culture of collaboration
Historically, developers and operators have often operated with conflicting goals, with developers prioritizing new features and being quick to market, while operators are aiming to minimize outages and server costs. There are many public stories of how operators are frustrated with the code they get from developers or data scientists, because the code misses important dependencies, does not document environmental assumptions (e.g., operating system version and firewall configuration), and simply is not stable or fast enough in production. Similarly, there are many stories of how developers are frustrated with operators about how slow and conservative they are with releases — for example, installing updates only every three months or planning an 2–4am downtime a month in advance to update systems. Developers commonly produce a minimally viable product and just test it locally, then assuming “works for me” extends to “works in production.”
DevOps as a practice has aimed to bring developers and operators closer together. We already introduced the main principles and tools in chapter Planning for Operations. Principles include a strong focus on automating testing, automated deployment, and observability with the goal of releasing software frequently and providing rapid feedback from production systems. These are supported by test automation and continuous integration tools, by container and orchestration tools, and by monitoring tools among others. MLOps pursues the same principles and goals when bringing machine-learned models into production, with plenty of tools for automation of pipeline and deployment steps. For example, MLOps tools can make it easy to deploy an updated depression prognosis model directly from a notebook into an A/B experiment in production.
From a teamwork perspective, the key contribution of DevOps and MLOps is how they establish a culture of collaboration with joint goals, joint vocabulary, and joint tools. Instead of an “us versus them” mentality between developers and operators that operate in their own silos with their own priorities, DevOps reframes the discussion to focus on the joint goal of frequently delivering value and learning from rapid feedback. Developers and operators work together toward this joint goal. DevOps puts the focus on the resulting product rather than on development or operation in isolation. Developers and operators integrate their work tightly through joint tooling at the interface of their roles, such as developers wrapping their software in containers for deployment and operators providing access to live telemetry through A/B testing infrastructure and monitoring dashboards. Developers and operators share the joint responsibility for delivering the product.
This culture of collaboration is supported by a framing in which developers and operators mutually benefit from working closely together. From a developer’s perspective, developers invest extra effort into automating testing, instrumenting their software to produce telemetry, and containerizing their software to be release-ready, but they benefit from seeing their code released in production soon after committing it (within minutes, not weeks or months). Beyond the satisfaction of seeing their code live, they are empowered to make their own deployment decisions and they can gather rapid feedback from users to inform subsequent development. Easy deployment into A/B testing infrastructure allows them to experiment in production and make data-driven decisions. From an operator’s perspective, having developers take care of containerization and preparing telemetry frees operators to focus on infrastructure for reliably operating systems and experimenting in production. Operators can invest their time in automating container orchestration and infrastructure for canary releases and A/B tests, rather than worrying about installing the right library dependencies and manually rolling back unsuccessful updates at 3 am in the morning from a backup. The benefits are analogous for data scientists in MLOps who can rapidly experiment in production at the cost of investing into infrastructure for packaging their models in a production-ready format (which might be as easy as understanding how to use a library like BentoML).
To facilitate collaboration in DevOps, both developers and operators agree on shared terminology (release, container, telemetry metrics) and work with joint tools, especially containers, versioning infrastructure, A/B testing infrastructure, and telemetry and monitoring infrastructure. These tools are used at the interface between both roles and both use them jointly. For example, rather than waiting for an operator to manually process a ticket in a workflow system, developers know how to push a new container version to a repository from where the operators’ systems will automatically deploy them as an experiment; they know where to look for collected telemetry that their system produces in production and where to find the experiment’s results. MLOps pursues the same kind of collaboration between data scientists and operators when training models or deploying models as inference service within a system, through common tools for pipeline execution, model versioning, model packaging, and so forth.
In a sense, DevOps and MLOps force developers and data scientists to learn about some aspects of an operator’s role and to understand their goals, and vice versa. Using joint tools requires a minimum of understanding of shared concepts and effectively enforces a joint vocabulary — effectively fostering a broader horizon as characteristic of T-shaped engineers.
Changing practices and culture
Adopting a DevOps or MLOps mindset and culture in an organization with previously siloed teams is not easy. Inertia from past practices can be hard to overcome (“this is how we always did things”) and the initial learning cost for adopting some of the DevOps tooling can be high for both developers and operators. Shifting from an “us versus them” mentality to a blameless culture of openly shared artifacts and joint responsibility requires substantial adjustment, as does moving from ad-hoc testing to rigorous test automation in continuous deployment pipelines. Cultural values are often deeply embedded in organizations and resistant to change without concerted efforts and buy-in from management. Developers who are not used to rapid feedback and telemetry from production systems may not anticipate the benefits without first-hand experience (but developers who have seen the benefits may never want to work in an organization without it again). Operators who are drowning in day-to-day emergencies may not have the time to explore new tools. Managers of siloed teams may worry about losing their importance in the organization; individual team members may fear for automating their own jobs away. Haphazard adoption of some DevOps practices may produce costs without providing the promised benefits — for example adopting release automation without automated tests or introducing container orchestration without a telemetry and monitoring plan.
Successful adoption of DevOps usually typically requires a culture change. Culture change can be affected bottom-up and top-down in an organization, but is very hard to achieve without supportive management. DevOps is often introduced by the advocacy of individuals, who convince colleagues and management. Education about principles and benefits is important to generate buy-in. Experienced hires or consultants can help with changing minds and with adoption complex tooling. It is usually a good idea to demonstrate benefits on a small scale before rolling out changes more broadly; for example, a small team could pioneer DevOps practices and DevOps culture on a single component and then proselytize the benefits in the rest of the company. Projects are more likely to demonstrate success if they focus on current bottlenecks and if they make sure key enablers like test automation and telemetry are implemented. In our depression prognosis scenario, if the organization does not already have a DevOps culture, we may focus on automating model deployment and experimentation as one of the points benefiting most from iterative development and rapid feedback.
DevOps can be implemented with separate development and operations teams that effectively communicate and collaborate across team boundaries with DevOps tools, but also with interdisciplinary teams that bring together developers and operators in the same team. The latter is particularly common in microservice architectures where each team deploys their own service and updates it independently from other parts of the system. In this context, MLOps is particularly attractive for deploying model inference services because they are naturally modular components as we discussed throughout the architecture and design chapters.
Beyond DevOps and MLOps
DevOps and MLOps focus on deployment of software and models, but they provide a model of how to think about a culture of collaboration also among other roles when developing ML-enabled systems. As discussed, friction is commonly observed between data scientists and software engineers, between data scientists and people gathering the data, between developers and lawyers, and at many other interfaces between roles. In all cases, we would like to see interdisciplinary collaboration toward joint goals over “us versus them” conflicts.
DevOps and MLOps show the power of setting joint goals, of focusing on delivering a product rather than the individual components and silos, but also the effort required to affect culture change in a siloed organization. Importantly, DevOps and MLOps highlight the mutual benefits gained by each group from the collaboration and from taking on extra work to help the other group. Tools at the interface between the groups define a common language and shared artifacts that codify the collaboration.
Let us consider the collaboration between data scientists and software engineers discussed throughout this book. That is, let us focus on the entire system and not just the deployment of a single ML component that is commonly addressed by MLOps. Our key argument is to work toward joint goals by understanding each other’s goals, concerns, and vocabulary. What would MLDev culture look like? What MLDev tools could support it? At the time of this writing, this area is not well explored. We can only speculate about practices and tools at the interface to
- collect system requirements and trace them to component and model requirements,
- analyze and document anticipated mistakes, design mitigation strategies, and plan incident response strategies,
- design the system architecture for scalability, modifiability, and observability,
- document and test data quality at the interface between components,
- document and test various model qualities,
- perform integration testing of ML and non-ML components in a system,
- facilitate debugging of models and their interactions in the system in production,
- facilitate experimentation with models within a software system,
- jointly version ML and non-ML components in the system, and
- perform threat modeling of the entire system and its components to identify security threats.
For example, we think that better tooling to describe target distributions of a model and better tooling to report model quality in a more nuanced and multi-faceted way capturing multiple qualities (possibly inspired by documentation mechanisms like Model Cards and FactSheets, see chapter Deploying a Model) will be powerful technologies where team members of different roles interact. They can infuse a shared understanding of concerns at the interface: When data scientists need to fill in documentation following a template, they are nudged to consider qualities that software engineers may find important; when software engineers draft a test plan they need to consider which model qualities are relevant to them. Tooling to automatically report evaluation results from an automated infrastructure could ease the process and replace ad-hoc communication with mature automation. Similarly, we think that including data scientists in requirements analysis, risk analysis, or threat modeling activities of the system through checkpoints in tools or artifacts can encourage the kind of T-shaped people and collaborative culture we want in our teams.
In all these steps, data scientists and software engineers may take on additional tasks, but they may also gain benefits, hopefully toward a win-win situation. For example, software engineers may have an easier time integrating models and can focus more on mitigating anticipated mistakes and providing better user experiences. Data scientists may receive data of better quality and also better guidance and clearer requirements; their models may be more likely to move from prototype to production.
While cultures and tools for areas like MLDev, LawDev, DataExp, SecDevOps, MLSec, SafeML, and UIDev (or many other silly abbreviations we could invent for combinations of various roles) may not be as popular or well understood as DevOps or MLOps, there will be people who have thought about these collaborations and may have valuable insights and even tools. We believe that the way that DevOps and MLOps embrace operators in a collaborative culture are success stories from which to seek inspiration for successful interdisciplinary collaboration.
Summary
Machine learning projects need expertise in many fields that are rarely held by a single person. A better strategy to scale work, both in terms of division of labor and division of expertise is to assemble interdisciplinary teams. Scaling teams can be challenging due to process cost and various team dysfunctions such as conflicting goals and groupthink, which can be exacerbated by the increased complexity and need for more diverse expertise in machine learning projects compared to many traditional software projects.
To address process costs it is essential to think about the interfaces between teams and to what degree stable and well-documented interfaces are possible or to what degree teams need to continuously iterate together on the product. Team members should be cognizant of potential problems and establish suitable communication and collaboration channels.
The interdisciplinary nature of machine-learning projects brings additional challenges well understood in teamwork more broadly, such as conflicting goals and groupthink. Education, T-shaped people, deliberate structuring of teams and processes to make use of specialized expertise, and establishing a culture of constructive conflict are all key steps to explore to make interdisciplinary teams work better. DevOps culture can provide an inspiration for a successful culture focused on joint goals, supported with joint tools, that may well provide lessons for collaboration among other roles in a team.
Further readings
- Study of the different roles of data scientists in software teams at Microsoft: Kim, Miryung, Thomas Zimmermann, Robert DeLine, and Andrew Begel. “Data scientists in software teams: State of the art and challenges.” IEEE Transactions on Software Engineering 44, no. 11 (2017): 1024–1038.
- Presentation arguing for building interdisciplinary teams with T-shaped people: Ryan Orban. “Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams.” Presentation 2016
- Detailed case studies of production machine learning discussing the need for interdisciplinary collaborations with various domain experts: Lvov, Ilia. “Project Management in Social Data Science: integrating lessons from research practice and software engineering.” PhD diss., University of St Andrews, 2019.
- Classic literature on teamwork in software engineering: Brooks Jr, Frederick P. The mythical man-month: essays on software engineering. Pearson Education, 1995. ● DeMarco, Tom, and Tim Lister. Peopleware: productive projects and teams. Addison-Wesley, 2013.
- Plenty of advice on team management in software teams: Mantle, Mickey W., and Ron Lichty. Managing the unmanageable: rules, tools, and insights for managing software people and teams. Addison-Wesley Professional, 2019.
- Classic work on team dysfunctions: Lencioni, Patrick. “The five dysfunctions of a team: A Leadership Fable.” Jossey-Bass (2002).
- Classic paper on information hiding and the need for interfaces between teams: Parnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. In Pioneers and Their Contributions to Software Engineering (pp. 479–498). Springer, Berlin, Heidelberg.
- Exploration of the role of awareness in software engineering to coordinate work: Dabbish, Laura, Colleen Stuart, Jason Tsay, and Jim Herbsleb. “Social coding in GitHub: transparency and collaboration in an open software repository.” In Proceedings of the ACM 2012 conference on computer supported cooperative work, pp. 1277–1286. 2012. ● de Souza, Cleidson, and David Redmiles. “An empirical study of software developers’ management of dependencies and changes.” In 2008 ACM/IEEE 30th International Conference on Software Engineering, pp. 241–250. IEEE, 2008. ● Steinmacher, Igor, Ana Paula Chaves, and Marco Aurélio Gerosa. “Awareness support in distributed software development: A systematic review and mapping of the literature.” Computer Supported Cooperative Work (CSCW) 22, no. 2–3 (2013): 113–158.
- Discussion of challenges when trying to integrate fairness concerns into organizational practices within and across teams: Rakova, Bogdana, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. “Where responsible AI meets reality: Practitioner perspectives on enablers for shifting organizational practices.” Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW1 (2021): 1–23.
- A detailed study on how to establish a DevOps culture in an organization: Luz, Welder Pinheiro, Gustavo Pinto, and Rodrigo Bonifácio. “Adopting DevOps in the real world: A theory, a model, and a case study.” Journal of Systems and Software 157 (2019): 110384.
- Fictional story that conveys DevOps principles and the struggle of changing organizational culture in an engaging way: Kim, Gene, Kevin Behr, and Kim Spafford. The Phoenix Project. IT Revolution, 2014.
- Influential paper on documenting machine learning models, which is right at the interface between producers and consumers of models that typically have different roles: Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. “Model cards for model reporting.” In Proceedings of the conference on fairness, accountability, and transparency, pp. 220–229. 2019.
- Examples of coordination and collaboration failures around data quality in ML-enabled software projects: Sambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. ““Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI”. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15. 2021.
Thanks to Shurui Zhou for helpful comments on this chapter. As all chapters, this text is released under Creative Commons 4.0 BY-SA license.