Top three lessons learned: Stay Humble; Don't Quit; Show UpRead more...
It encourages learning and teams that deploy it get faster over time
Of all the tactics I have advocated as part of the lean startup, none has provoked as many extreme reactions as continuous deployment, a process that allows companies to release software in minutes instead of days, weeks, or months. My previous startup, IMVU, has used this process to deploy new code as often as an average of fifty times a day. This has stirred up some controversy, with some claiming that this rapid release process contributes to low-quality software or prevents the company from innovating. If we accept the verdict of customers instead of pundits, I think these claims are easy to dismiss. Far more common, and far more difficult, is the range of questions from people who simply wonder if it's possible to apply continuous deployment to their business, industry, or team.
The particulars of IMVU’s history give rise to a lot of these concerns. As a consumer internet company with millions of customers, it may seem to have little relevancy for an enterprise software company with only a handful of potential customers, or a computer security company whose customers demand a rigorous audit before accepting a new release. I think these objections really miss the point of continuous deployment, because they focus on the specific implementations instead of general principles. So, while most of the writing on continuous deployment so far focuses on the how of it, I want to focus today on the why. (If you're looking for resources on getting started, see "Continuous deployment in 5 easy steps")
The goal of continuous deployment is to help development teams drive waste out of their process by simultaneously reducing the batch size and increasing the tempo of their work. This makes it possible for teams to get – and stay – in a condition of flow for sustained periods. This condition makes it much easier for teams to innovate, experiment, and achieve sustained productivity. And it nicely compliments other continuous improvement systems, such as Five Whys.
One large source of waste in development is “double-checking.” For example, imagine a team operating in a traditional waterfall development system, without continuous deployment, test-driven development, or continuous integration. When a developer wants to check-in code, this is a very scary moment. He or she has a choice: check-in now, or double-check to make sure everything still works and looks good. Both options have some attraction. If they check-in now, they can claim the rewards of being done sooner. On the other hand, if they cause a problem, their previous speed will be counted against them. Why didn't they spend just another five minutes making sure they didn't cause that problem? In practice, how developers respond to this dilemma is determined by their incentives, which are driven by the culture of their team. How severely is failure punished? Who will ultimately bear the cost of their mistakes? How important are schedules? Does the team value finishing early?
But the thing to notice in this situation is that there is really no right answer. People who agonize over the choice reap the worst of both worlds. As a result, developers will tend towards two extremes: those who believe in getting things done as fast as possible, and those who believe that work should be carefully checked. Any intermediate position is untenable over the long-term. When things go wrong, any nuanced explanation of the trade-offs involved is going to sound unsatisfying. After all, you could have acted a little sooner or a little more careful – if only you’d known what the problem was going to be in advance. Viewed through the lens of hindsight, most of those judgments look bad. On the other hand, an extreme position is much easier to defend. Both have built-in excuses: “sure there were a few bugs, but I consistently over-deliver on an intense schedule, and it’s well worth it” or “I know you wanted this done sooner, but you know I only ever deliver when it’s absolutely ready, and it’s well worth it.”
These two extreme positions lead to factional strife in development teams, which is extremely unpleasant. Managers start to make a note of who’s on which faction, and then assign projects accordingly. Got a crazy last-minute feature, get the Cowboys to take care of it – and then let the Quality Defenders clean it up in the next release. Both sides start to think of their point of view in moralistic terms: “those guys don’t see the economic value of fast action, they only care about their precious architecture diagrams” or “those guys are sloppy and have no professional pride.” Having been called upon to mediate these disagreements many times in my career, I can attest to just how wasteful they are.
However, they are completely logical outgrowths of a large-batch-size development process that forces developers to make trade-offs between time and quality, using the old “time-quality-money, pick two fallacy.” Because feedback is slow in coming, the damage caused by a mistake is felt long after the decisions that caused the mistake were made, making learning difficult. Because everyone gets ready to integrate with the release batch around the same time (there being no incentive to integrate early), conflicts are resolved under extreme time pressure. Features are chronically on the bubble, about to get deferred to the next release. But when they do get deferred, they tend to have their scope increased (“after all, we have a whole release cycle, and it’s almost done…”), which leads to yet another time crunch, and so on. And, of course, the code rarely performs in production the way it does in the testing or staging environment, which leads to a series of hot-fixes immediately following each release. These come at the expense of the next release batch, meaning that each release cycle starts off behind.
Many times when I interview a development team caught in the pincers of this situation, they want my help "fixing people." Thanks to a phenomenon called the Fundamental Attribution Error in psychology, humans tend to become convinced that other people’s behavior is due to their fundamental attributes, like their character, ethics, or morality – even while we excuse our own actions as being influenced by circumstances. So developers stuck in this world tend to think the other developers on their team are either, deep in their souls, plodding pedants or sloppy coders. Neither is true – they just have their incentives all messed up.
You can’t change the underlying incentives of this situation by getting better at any one activity. Better release planning, estimating, architecting, or integrating will only mitigate the symptoms. The only traditional technique for solving this problem is to add in massive queues in the forms of schedule padding, extra time for integration, code freezes and the like. In fact, most organizations don’t realize just how much of this padding is already going on in the estimates that individual developers learn to generate. But padding doesn’t help, because it serves to slow down the whole process. And as all development teams will tell you – time is always short. In fact, excess time pressure is exactly why they think they have these problems in the first place.
So we need to find solutions that operate at the systems level to break teams out of this pincer action. The agile software movement has made numerous contributions: continuous integration, which helps accelerate feedback about defects; story cards and kanban that reduce batch size; a daily stand-up that increases tempo. Continuous deployment is another such technique, one with a unique power to change development team dynamics for the better.
Why does it work?
First, continuous deployment separates out two different definitions of the terms “release.” One is used by engineers to refer to the process of getting code fully integrated into production. Another is used by marketing to refer to what customers see. In traditional batch-and-queue development, these two concepts are linked. All customers will see the new software as soon as it’s deployed. This requires that all of the testing of the release happen before it is deployed to production, in special staging or testing environments. And this leaves the release vulnerable to unanticipated problems during this window of time: after the code is written but before it's running in production. On top of that overhead, by conflating the marketing release with the technical release, the amount of coordination overhead required to ship something is also dramatically increased.
Under continuous deployment, as soon as code is written, it’s on its way to production. That means we are often deploying just 1% of a feature – long before customers would want to see it. In fact, most of the work involved with a new feature is not the user-visible parts of the feature itself. Instead, it’s the millions of tiny touch points that integrate the feature with all the other features that were built before. Think of the dozens of little API changes that are required when we want to pass new values through the system. These changes are generally supposed to be “side effect free” meaning they don’t affect the behavior of the system at the point of insertion – emphasis on supposed. In fact, many bugs are caused by unusual or unnoticed side effects of these deep changes. The same is true of small changes that only conflict with configuration parameters in the production environment. It’s much better to get this feedback as soon as possible, which continuous deployment offers.
Continuous deployment also acts as a speed regulator. Every time the deployment process encounters a problem, a human being needs to get involved to diagnose it. During this time, it’s intentionally impossible for anyone else to deploy. When teams are ready to deploy, but the process is locked, they become immediately available to help diagnose and fix the deployment problem (the alternative, that they continue to generate, but not deploy, new code just serves to increase batch sizes to everyone’s detriment). This speed regulation is a tricky adjustment for teams that are accustomed to measuring their progress via individual efficiency. In such a system, the primary goal of each engineer is to stay busy, using as close to 100% of his or her time for coding as possible. Unfortunately, this view ignores the overall throughput of the team. Even if you don’t adopt a radical definition of progress, like the “validated learning about customers” that I advocate, it’s still sub-optimal to keep everyone busy. When you’re in the midst of integration problems, any code that someone is writing is likely to have to be revised as a result of conflicts. Same with configuration mismatches or multiple teams stepping on each others’ toes. In such circumstances, it’s much better for overall productivity for people to stop coding and start talking. Once they figure out how to coordinate their actions so that the work they are doing doesn’t have to be reworked, it’s productive to start coding again.
Returning to our development team divided into Cowboy and Quality factions, let’s take a look at how continuous deployment can change the calculus of their situation. For one, continuous deployment fosters learning and professional development – on both sides of the divide. Instead of having to argue with each other about the right way to code, each individual has an opportunity to learn directly from the production environment. This is the meaning of the axiom to “let your defects be your teacher.”
If an engineer has a tendency to ship too soon, they will tend to find themselves grappling with the cluster immune system, continuous integration server, and five whys master more often. These encounters, far from being the high-stakes arguments inherent in traditional teams are actually low-risk, mostly private or small-group affairs. Because the feedback is rapid, Cowboys will start to learn what kinds of testing, preparation and checking really do let them work faster. They’ll be learning the key truth that there is such a thing as “too fast” – many quality problems actually slow you down.
But for engineers that have the tendency to wait too long before shipping, they too have lessons to learn. For one, the larger the batch size of their work, the harder it will be to get it integrated. At IMVU, we would occasionally hire someone from a more traditional organization who had a hard time letting go of their “best practices” and habits. Sometimes they’d advocate for doing their work on a separate branch, and only integrating at the end. Although I’d always do my best to convince them otherwise, if they were insistent I would encourage them to give it a try. Inevitably, a week or two later, I’d enjoy the spectacle of watching them engage in something I called “code bouncing.” It's like throwing a rubber ball against the wall. In a code bounce, someone tries to check in a huge batch. First they have integration conflicts, which require talking to various people on the team to know how to resolve them properly. Of course, while they are resolving, new changes are being checked in. So new conflicts appear. This cycle repeats for a while, until the team either catches up to all the conflicts or just asks the rest of the team for a general check-in freeze. Then the fun part begins. Getting a large batch through the continuous integration server, incremental deploy system, and real-time monitoring system almost never works on the first try. Thus the large batch gets reverted. While the problems are being fixed, more changes are being checked in. Unless we freeze the work of the whole team, this can go on for days. But if we do engage in a general check-in freeze, then we’re driving up the batch size of everyone else – which will lead to future episodes of code bouncing. In my experience, just one or two episodes are enough to cure anyone of their desire to work in large batches.
Because continuous deployment encourages learning, teams that practice it are able to get faster over time. That’s because each individual’s incentives are aligned with the goals of the whole team. Each person works to drive down waste in their own work, and this true efficiency gain more than offsets the incremental overhead of having to build and maintain the infrastructure required to do continuous deployment. In fact, if you practice Five Whys too, you can build all of this infrastructure in a completely incremental fashion. It’s really a lot of fun.
One last benefit: morale. At a recent talk, an audience member asked me about the impact of continuous deployment on morale. This manager was worried that moving their engineers to a more-rapid release cycle would stress them out, making them feel like they were always fire fighting and releasing, and never had time for “real work.” As luck would have it, one of IMVU’s engineers happened to be in the audience at the time. They provided a better answer than I ever could. They explained that by reducing the overhead of doing a release, each engineer gets to work to their own release schedule. That means, as soon as they are ready to deploy, they can. So even if it’s midnight, if your feature is ready to go, you can check-in, deploy, and start talking to customers about it right away. No extra approvals, meetings, or coordination required. Just you, your code, and your customers. It’s pretty satisfying.
(Image source: ciadvantage.com)