If there are one thing developers like less than writing documentation, it’s responding to unnecessary escalations.
Running a successful website requires the cooperation of operations (ops) and developers (devs) thus the term “DevOps”. When there is a conflict, antagonisms, or disharmony, the website suffers. Simply telling people to “get along” is not effective. Authentic cooperation is the result of providing a structure that enables and encourages cooperation.
Often one such area of conflict and disharmony involves on-call escalations. The monitoring system alerts ops folks that there is a problem or a growing issue that could lead to an outage. Whoever is ‘on-call from operations must handle the issue or escalate the issue to developers, no matter the time of day or night.
There lies the potential for conflict. Too many escalations wear down the developers. The disharmony begins with exclamations like “I just fixed something easy! Why can’t those operations folks do their jobs?”
Operations get defensive. “How was I supposed to know?” or, “I just asked a question, and now they’re being jerks about it!”
The disharmony can start in operations too. “Oh great! Another surprise from the devs!”
You can’t force people to cooperate, but you can set up structures and glide paths that create an environment for cooperation.
One such paradigm is the dynamic runbook feedback loop.
Dynamic runbook feedback loops
A runbook is a set of procedures for how to respond in situations such as receiving an alert from the monitoring system. The goal of the feedback loop is to create a mechanism where subject-matter experts create runbooks, but both devs and ops are empowered to improve them in ways that reduce the number of escalations and improve cooperation.
The goal of this process is to establish the proper balance of effort versus value when crafting documentation. It is a waste of effort for someone to write a 100-page treatise about a simple issue; but a runbook that is too brief isn’t useful. This paradigm leads to the right balance. A runbook starts at the size the original author believed to be appropriate given the knowledge at hand, but evolves to the right size as it gets exercised. Runbooks that are rarely or never needed to receive less effort and frequently-used runbooks are updated, optimized, and possibly turned into automated responses.
This is in striking contrast to organizations where runbooks are created by “on high” without input from people with direct involvement. Often these either can’t be changed or change require a heavy-weight process that impedes any improvement.
Write it down
The easiest way to distribute team knowledge is to have good documentation, allowing anyone who encounters an unfamiliar issue to follow a tested process to resolve it. That’s what runbooks are supposed to be.
Our preferred format is a bullet list that ops can use to resolve alerts. When an alert arrives, ops follow the instructions in the runbook. If we get to the end of the bullet list and the issue is unresolved, then ops escalate to the developers.
An organization’s developers need to write the docs, obviously. But all too often, docs get assigned as low priority and get pushed to the back burner in order to ship new product, feature upgrades, and other work deemed mission-critical. They never get around to it. Management needs to include runbook creation as part of the project.
Developers are not always comfortable writing. Staring down a blank page can be intimidating. To overcome the blank page syndrome, give them a template. It won’t even feel like you’re asking them to write a document; you just want them to fill out this form. If you make the template a series of bullet points, each being the procedure to handle a specific alert, the document becomes almost trivial to complete. The basic piece of documentation here is how to deal with every alert.
To motivate devs, remind them that the more time spent writing runbooks, the less likely issues will escalate to them. Every company wants to reduce escalations—especially the developers who have to field those escalations during off hours—and the path to get there is through documentation.
The other side of this process—the alerts—should reinforce the idea that any alert has a corresponding process in the runbooks by including the runbook URL in the alert text. This increases the likelihood that procedures will be complied with.
Channeling feedback into better runbooks
Here’s where the feedback loop begins.
At some point, an ops person is going to get an alert that is insufficiently documented and needs to be escalated to the developers. Once you go through every step on the runbook and exhaust every idea you have, it’s time to talk to the folks who wrote the code.
This is the first point of feedback: the ops engineer reads the runbook and tries to implement the process documented therein. The dev may have documented every alert, but this is where the rubber hits the road; if a runbook doesn’t fix a problem when it says it does, the ops person should correct it immediately or at least identify the problem. When the ops person escalates to the developer, that’s the start of a feedback loop.
If a problem does get escalated, then that should trigger an update to the doc. Whether it’s the ops engineer adding what they were unsure about, noting the use case that triggered the escalation, or the developer augmenting the bullet list, the end result makes operations more self-sufficient and reduces future escalations. Maybe your clever ops engineer wrote a shell script that bundles ten bullets into a single command; edit the runbook and include it.
And yet, sometimes you run into unknown or unpredictable problems. At a previous company, I got a runbook that was just one bulleted item: “This should never happen. If it does, call the developers.” I spoke with the developer and, without accusing them of being lazy, asked if this was the best runbook entry they could make? It turned out it was! They were monitoring for an assertion that should never happen, but if it ever did happen, they wanted to know. It was the kind of situation where a fix could only be determined after the first time it happened.
It is important to remove speed bumps to fixing and updating runbooks. Management should create a culture where everyone is expected and encouraged to edit frequently, in real-time, not at the end of a project. I recommend that ops engineers go one step further. When you read a runbook, do it in edit mode. It removes the temptation to procrastinate about updates. You may think, ‘I’m in the middle of addressing a problem in production, I’ll come back later and edit this.’ No, you’re never going to come back. There’s never a later. If you’re in edit mode, you can fix that broken comma, paste in an updated command, or at least make a note that something needs improvement.
If a step doesn’t work, or you’re not sure what does, still put it in the edit. These are living documents, so not every edit has to be perfect. Put in a note to self or the developer: “that last statement is confusing” or “bullet three didn’t work.” At least the next person that sees this will know to be careful when they get to bullet item number three. Identifying the problem is better than silence.