I met with a client today who is trying to move to a DevOps model for software delivery. One question they asked caught me off guard: how do you deal with having to wake developers up in the middle of the night?
The challenge they faced was the DevOps requirement that instead of supporting production code with a separate support team, each developer would be responsible for supporting their own code – even if the problem cropped up after hours.
I saw their point, although I’m not aware of any organizations who have this requirement (if you have this policy, please post a comment to that effect). It did seem to me that such a policy would be a disincentive to put buggy code into production, which would be a good thing. On the other hand, it would also encourage developers to play it safe, and perhaps not be as bold with their code as they otherwise would be, which might limit their creativity.
But then the client informed me that they actually paid developers who were on call a bonus for every week they served in this support role. Cold hard cash, of course, changes the incentive picture completely. Now the question is whether the cash more than compensates for the hassle of middle-of-the-night support calls, and in the extreme case, might actually lead to intentional sloppiness.
The challenge with such governance questions is making sure that you are incentivizing the behavior you desire. In the case of DevOps, you not only want to incentivize writing bug-free, creative code, you also want to encourage the proper configuration of the operational environment – which means in part automated recovery from failure. Sure, bugs still crop up, but if you can automate your way around them, then fixing them can likely wait until morning.
It’s no mistake that Cloud and DevOps go hand in hand in most organizations moving to a DevOps model, since automating the operational environment is an essential enabler of successful DevOps. But this automation is not just about streamlining deployment and the proper running of code in production. It’s also about automating responses to failure.