The purpose of this article is to answer the following question: “How does one deliver high-quality software on schedule with 100+ developers distributed across multiple time zones?” It’s meant to serve as a guide for a technical lead or project manager who is responsible for defining the development workflow in a large-scale, globally-distributed software project. It will help to guide decisions about which processes and tools to choose in order to achieve high-quality software at the highest level of efficiency.
Focus on Bottlenecks
Within most Agile Software Development methodologies, a lot of focus is put on “blockers.” Blockers are called out in daily meetings and retrospectives, and are often escalated rapidly (depending on who has encountered them). While this focus is often warranted, what’s lost in this whirlwind of blockers are the silent killers of productivity: bottlenecks.
With a large-scale distributed team of developers, the identification and elimination of bottlenecks is a massive contributor to overall throughput and even quality (for example, if a quality gate becomes a bottleneck, schedule pressure may force developers to skip it).
As a project evolves, it’s important to constantly identify and resolve bottlenecks. Project leads, managers, and other facilitators should carefully consider the impact of each bottleneck when prioritizing work. An isolated blocker could prevent one developer from working for an entire day, but a bottleneck could cause 100+ developers to work at half speed for months. Ask questions like “What’s slowing you down?” or “What’s preventing you from working efficiently?” instead of just “Are you blocked?” If the answers to “What’s slowing you down?” are common across the project, these items should be prioritized above low-impact blockers.
Invest in Automation and Dedicated Support Roles
Although overhead is a reality of working on a large-scale project, that doesn’t mean it all has to fall onto the developers themselves. Many developers are frustrated by tedium like merging change sets; long delays caused by triggering and then waiting for test cases or builds to complete; and being blocked waiting for someone else to test or review their changes. Over time, this will lead to shortcuts and declining morale.
Find ways to spread out and streamline the work. Invest heavily in solving these problems up front and creating the necessary automation. Otherwise, any given development workflow could quickly collapse under the weight of 100+ developers across the globe working in the same code base.
A project’s chosen branching strategy will have a very significant impact on the overall development workflow. There are several possible branching strategies, but for a large-scale project, a stable main development line and small scoped changes are critical for taking advantage of the principles of continuous integration. Thus, two potential options for a branching strategy are:
- A heavy branching strategy where each small work item is developed on its own branch and integrated back to the trunk after a predetermined set of quality checks are completed.
- A branchless strategy where large-scale automation and tooling is setup to protect the trunk and allow for each small work item to be validated and automatically merged.
Each strategy has its pros and cons, but the decision here will likely hinge on the project’s other available (or planned) tooling. The most critical of these tools is the version control software (VCS), which will be discussed in more detail later. VCS that doesn’t support branching well may push the project into a branchless strategy. A branchless strategy can be simpler conceptually and can avoid common pitfalls related to branching and merging in a VCS. However, additional tooling may need to be developed and/or purchased to support collaboration and merging operations.
As mentioned above, a stable trunk is critical in a large-scale project. Ideally, the project should invest heavily in automation up front to protect the trunk. Some examples of tooling that should be developed:
- Tooling that allows developers to submit changes (e.g., via patch file or by submitting a branch name), and the underlying automation will verify and merge the changes to trunk on success.
- Tooling that automatically executes (or verifies) the various quality checkpoints that are put in place on the project.
- Tooling that runs quality checkpoints on trunk and can automatically identify (and even revert) changes that fail these checks.
- Tooling that provides developers with known stable versions (or branches) of trunk that they can develop new changes against.
If the project can’t afford to invest in such automation up front, a less work-intensive and less technically complex solution is to develop processes to protect trunk. The balance of processes versus automated tooling should mostly depend on the length of the project. Even with proper training, enforcement, and support, agreed-upon processes will never be as effective as automated tooling for protecting trunk, as they have several weaknesses in a large team environment:
- Training: With a large team and multiple language/time-zone barriers, training on processes can be extremely difficult to prepare, schedule, and communicate. As the project evolves, the processes will undoubtedly change, requiring constant training to share these changes. Also, no matter how much training is provided, there will still be mistakes that could be prevented with automation.
- Enforcement: Manual enforcement will always be less efficient and less complete than automated tooling enforcement. Automated tooling never sleeps, never takes a vacation, and never makes mistakes.
- Support: With a large and distributed project, it’s rare that unanimous agreement among all development sites can be reached on every decision. Some cultures or groups of developers may find certain processes either too restrictive or too loose. Thus, it can be difficult to find a fully supportive advocate for each process at all development sites.
However, it can be difficult to predict all of the necessary tooling that will be required to support the development workflow at the start of the project. Therefore, it’s also important not to over-invest in tooling that might not end up being necessary. Instead of trying to predict all possible use cases up front, a minimal set that supports the standard workflow is the best place to start. New tooling can be developed on an as-needed basis to reduce bottlenecks and prevent common process violations.
An example of an automated environment to support a large-scale workflow will be covered in the “Example Development Workflow” section.
Defining a Small Scope
As mentioned above, in any branching strategy, keeping the scope of work small and limiting the amount of time that changes are separate from the trunk will help ensure success. The longer a branch change is not merged, the longer it will take to merge due to increased likelihood of conflicts. That, in turn, keeps the changes separated longer, creating a vicious cycle.
Thus, the two biggest factors in limiting the lifespan of each code change are pre-merge validation overhead and scope definition per work item. The “Quality Checkpoints” section will discuss various ways to reduce pre-merge overhead. However, the scope definition for the amount of work done in each merge to the trunk may be the most complex piece of this formula. This topic is worthy of its own paper, as keeping the scope of these work items small requires a very specific technical skillset.
Another critical item is having an architecture in place that supports parallel development. The software must be separated into discrete and reasonably sized components that encapsulate specific functionality. APIs must exist as the borders between these components, and those APIs must be agreed upon as the first work item, since this definition is the bottleneck that releases all future parallel work that depends on those APIs. In other words, the project should seek to understand and follow the principles of Interface-Based Development.
It can be difficult to keep this small-work scope consistent over the length of the project. That’s because those defining the work and those implementing it may try to “save time” and avoid dependencies by combining work items, not realizing the impact this has on the success of the branching strategy. In other words, the importance of small scope for each change must be constantly stressed over the length of the project and to all parties involved.
Version Control Software
Version control software (VCS) will be the most impactful support tool in a large-scale development workflow. The choice should primarily depend on (or will impact) the project’s chosen branching strategy. Ideally, a branching strategy should be defined first, and then the project can determine the VCS that best supports it. However, if the VCS has already been chosen, the project may want to revise its branching strategy based on the VCS being used.
Another key consideration is the other available support tooling. Specifically:
- Repository replication software: Use of Apache Subversion (SVN) in a highly distributed project should come along with repository replication software to avoid performance issues for developers working far from the centralized SVN server. Otherwise, distributed VCS (e.g., Git or Mercurial) is designed to support this by minimizing the operations that require server interaction.
- Operating system of development environment: Git has limited support in Windows, but is optimized for Linux. Mercurial and SVN have support for both Windows and Linux.
- Developer toolchain and hosting service: GitHub is widely praised, but only supports Git. Atlassian’s BitBucket supports Git and Mercurial. Atlassian’s code review and indexing tools support SVN, but have performance issues with heavy branching.
Another key consideration in the overall workflow is where to place the various quality checks the project has defined. The project must first decide what level of quality should be maintained on the trunk. Then tests and processes should be set up to verify issues that could cause the stable trunk to dip below that level of quality—before changes are accepted there.
For maximum stability, one should obviously place as many of the Quality Checkpoints before changes are merged to the stable trunk as possible. However, for a branching strategy to remain successful, minimizing the delay in getting changes to the stable trunk can be as important as keeping that trunk stable. In general, the checkpoints that make the most sense to do before a merge to the trunk are the automated ones.
Example quality checkpoints include:
- Code reviews
- Unit testing
- Integration testing
- Manual testing
- Regression testing
- Static analysis
- Generation of documentation
We’ll dive into the specifics of a few of these below.
Code reviews are a critical quality check in any software project. There’s no substitute for having another developer review your changes, even if it’s only a sanity check. Writing code can be monotonous and extremely mentally taxing, and simple mistakes can easily be caught with a second set of eyes. Code reviews are also a great way to train new developers on best practices and to set the code quality expectations across the development team.
Despite being a manual process, code reviews are more likely to be successful before the merge to the stable trunk, because developers will be more receptive to feedback before they have invested the time to integrate their changes. If the code review happens after the merge, then the author will likely push back on stylistic and good programming practice suggestions, since modifying their change now represents a significant amount of extra work. If the team is made up of experienced developers, perhaps this type of simple sign-off is fine.
However, with a large team, there’s likely to be a split of senior and junior developers. In this scenario, it’s worthwhile to have experienced developers take the time to review each change set. Authors should expect there will be feedback that results in modifications.
Creating an Efficient Code-Review Process
The most important thing when setting up an efficient code-review process in a large-scale project is to keep an eye out for possible bottlenecks and dependencies. Here are some example bottlenecks and guidelines for how to avoid them:
Bottleneck: Reviews contain too much code and thus take too long to complete.
Mitigation: Define small-scope work packages (see the “Defining a Small Scope” section).
Bottleneck: Tooling causes it to take a long time to prepare a code review after committing.
Mitigation: Use a tool that supports the project’s chosen branching strategy and VCS (see “Code Review Tooling” section below).
Bottleneck: Tooling doesn’t allow reviewers to provide clear feedback on changes to the author.
Mitigation: Use a tool that allows for efficient communication (see “Code Review Tooling” section below) or consider holding in-person reviews if the team distribution allows for it.
Bottleneck: Those that can “approve” a review are overloaded or have other responsibilities.
Mitigation: Designate responsive reviewers who have both the time and motivation for completing reviews.
If the project chooses to gate reviews on approval from a list of experienced developers, be aware of the impact this has on their total bandwidth. In a large team, experienced developers are likely to be in high demand; thus, they may be asked to work on several high-priority tasks. Also make sure that those designated as approved reviewers are responsive and reliable.
Regardless of who is doing the reviews, make sure there’s a clear motivation for reviewers to complete code reviews. In a large team, team members are often measured by the amount of work items they complete. Also, the time they spend on those work items may be tracked. If the task of performing a review isn’t tracked or counted in the same way and weighted appropriately, there may be no motivation for team members to complete reviews in a timely manner.
Bottleneck: Time is wasted on discussions related to personal coding preferences, trivial formatting discrepancies, and “bike-shedding.”
Mitigation: Define coding standards and blocker criteria and then monitor reviews for bad behavior.
Outline a set of coding standards and naming conventions up front and make sure reviewers are aware of them.
Define what does/doesn’t constitute a blocker for a review to be finished. Make sure the code-review tool has a way to designate blocking/non-blocking feedback so that code authors know what they do and don’t need to address to pass the review stage.
Technical leads or auditors that defined these standards and rules should stay involved in reviews and drop into code-review sessions to make sure those with the authority to block reviews aren’t abusing this power, mislabeling trivial feedback, or focusing on inconsequential issues. This can artificially extend the length of reviews, and junior developers may feel they can’t push back on this kind of feedback by themselves.
Assuming in-person reviews or peer programming reviews are not the only code reviews being performed, the project will likely need a specialized tool to enable shared code reviews. A wide range of code review tools are available. Instead of comparing specific tools, here are some key pieces of functionally that are necessary for a successful and efficient code review:
- Ability to provide feedback on each change to the author.
- Ability to clearly see the modified code and the context (i.e. surrounding code).
- Ability to navigate the source code.
The ability to provide feedback effectively to the author is a critical function of the code-review tool. Otherwise, the author could just email patches to the reviewers. The best tools allow reviewers to add comments in line with the code itself, so the author can see exactly the line of code they’re referencing. This increases code-review effectiveness and facilitates communication.
Typically, a key factor in achieving items #2 and #3 above is how well the code-review tool integrates with the project’s VCS of choice. If the code-review tool is integrated directly into the VCS in some way or built specifically for that VCS, this is the ideal scenario. If the tool was built instead with focus on the user experience perspective, there could be issues when integrating with a given VCS.
Unit testing in a large distributed team is critical. Depending on the scope of each work package, it might be the only form of dynamic (i.e., runtime) verification each set of changes undergoes before being merged to the stable trunk. In addition, if each new test is added to a larger suite of unit tests, this can serve as a fast regression suite to prevent developers from undoing or breaking each other’s changes at a basic level.
The most important consideration for unit testing in a large team is speed. Otherwise, this step will quickly become a bottleneck in the workflow.
Make sure the tests themselves run fast, and if possible, ensure the framework pieces for the unit tests are set up prior to the start of full-scale development.
Static-analysis tooling is an effective way to programmatically detect common coding mistakes and enforce certain coding standards and conventions. Its execution can typically be automated and, depending on the tool, is able to provide results much faster than a manual code review. In a large distributed workflow, make sure the static-analysis process is fast and supportive of the overall workflow. Tune the ruleset to minimize false positives and choose a tool that supports your VCS and branding strategy.
Managing software construction at a global scale can be challenging. However, choosing the appropriate tools and processes can help teams efficiently achieve high-quality software.
Dylan Brandtner is Development Lead at Elektrobit.