Skip to content
GitHub Copilot is now available for free. Learn more

How to write an internal production failure incident communication

What do you say when the system is down?

Artwork: Ariel Davis

Photo of Kevin Riggle
 logo

Kevin Riggle // Principal Consultant,

The ReadME Project amplifies the voices of the open source community: the maintainers, developers, and teams whose contributions move the world forward every day.

When something unacceptably bad happens in an organization, time is of the essence to address it. The more clearly people within the organization can communicate about what has gone wrong and where they are in the process of fixing it, the faster the organization can bring the right people and resources to bear on the problem, and the faster it can be healed and the organization return to normal operation.

In my experience, most organizations of any size will develop formal incident management processes with tooling and training to help manage the communication and coordination around bad things that happen, but none of that is strictly necessary. In fact, in order for that tooling and training to be successful, it’s important that it grows out of an organization’s healthy pre-existing culture around incidents, rather than being imposed from outside.

Even organizations as small as two or three people can benefit from thinking a bit about how we communicate around incidents. Conversely, I’ve been part of even large organizations which successfully used quite simple tools—a single email list, a couple of telephone conference bridges—and that success was possible because we had a strong culture around incident communication.

When responding to any bad thing, there are a few things that must be communicated, to some level of detail, as widely as possible within the organization:

  1. What we are perceiving which causes us to believe that something bad may be happening;

  2. Our best guess right now of how bad it is;

  3. How far along we are in our response to it;

  4. Which one person is directly responsible for coordinating the response;

  5. Where we’re coordinating;

  6. Who else is involved and in what capacity.

When I say “as widely as possible” I really do mean that. Certain people (e.g. many of my security colleagues) will always respond that certain incidents (e.g. security incidents) are special and need to have restricted distribution. This article is too short to fully address that concern, but, briefly, you can always leave sensitive details out of the message, and you should always leave sensitive details out of the message when appropriate.

In ten years in the industry, though, I’ve seen many incidents fail to receive appropriate attention because nobody ever sent the email declaring them as an incident (until I did, sometimes). On the flip side, I’ve never seen an incident made materially worse because someone sent the email about it—and I can think of a couple of examples that might have been but weren’t. I’m sure that counter-examples exist. They’re not the rule but the exception.

We should always err on the side of sharing information more broadly rather than less, and no healthy organization should punish or exploit such sharing.

A sample outreach might look something like this:

1
2
3
4
5
6
7
8
9
10
11
From: kevin@example.com
 
To: incidents-list@example.com
 
Subject: Production has failed over to secondary database
 
Severity 2, phase 1. I am incident commander. Coordinating via #2021-03-15-production-failover channel on Slack.The primary database cluster stopped responding to requests at 0:02 UTC (17:02 PT).  After thirty seconds, production failed over to the secondary database and alerted us. The site is still functioning, however if whatever took down the primary database cluster also takes down the secondary cluster, the site will be hard-down.
 
Customer liaison: Jayla M.
Business liaison: Shruti S.
Subject-matter experts: Alícia S., Dave D.

This is a short email, but it contains all of the information I mentioned above.

  1. The subject line gives a quick summary of what we are perceiving.

  2. The severity describes how bad the problem is. Different organizations classify incidents differently, but here we’re using a scale from four (mild) to one (severe), making Severity 2 a serious but not severe incident. The details are mostly for your organization to determine, and will change as the organization changes—a potential loss of a million dollars might be severe (Severity 1) for a five million dollar company and moderate (Severity 3) for a fifty billion dollar company. What’s important is that everyone at the organization understands roughly what the severities mean. As an engineer, if I tell my manager that I’ve been pulled in to work on a Severity 1 incident, my manager should understand instantly that I’m protecting the future of the company. The severity can also be changed throughout the incident. If what we thought was a small problem turns out to be much more serious, then we should send a follow-up email with the new information and update the severity accordingly.

  3. The phase describes how far along we are in our response. While again different organizations will describe the steps differently, in general, all incidents go through roughly the same steps. The first is a (usually implied) phase of discovery. For example, to use the example of a structure fire, this would be when somebody metaphorically smells smoke and calls the fire department. This is followed by a phase of quickly patching the immediate issue (here, phase 1), e.g. using hoses to put the fire out. Then comes a phase of more fully resolving the issue (phase 2), e.g. checking the structure for any lingering hot spots and boarding up doors and windows. Last is a phase of rebuilding and learning (phase 3), e.g. the long process of figuring out where the fire started and how it spread, dealing with homeowners’ insurance, rebuilding the structure, and putting changes in place so that such an incident is less likely to happen in the future. Every time the incident moves to a new phase, we should send a follow-up email indicating this.

  4. The incident commander is the one person who is directly responsible for the response. ‘Incident Commander’ is a term from the Incident Command System, the US government standard for emergency response, and in general, I try to align corporate and computer-related incident response processes with other emergency response processes. You don’t have to use this term, however. Other organizations may call this person the incident manager, the incident PM (project manager), or the incident DRI (directly responsible individual). Whatever term you use, this individual is ultimately responsible for everything that happens as part of the response, and conversely requires broad latitude to involve people from departments across the company and requisition resources in order to respond effectively. The incident commander need not remain the same person for the entire incident, and for serious and long-running incidents they should not remain the same person, and it is the incident commander’s particular responsibility to ensure that they hand the incident off before it becomes necessary for the organization to relieve them of it (and the organization to ensure that there are fresh and capable incident commanders to receive it). Every time the incident is handed to a new incident commander, an email should be sent to reflect this.

  5. Where coordination is taking place. Fundamentally the reason that we treat incident work differently than normal work is because normal coordination mechanisms aren’t fast and broad enough to accomplish what needs to be done, and coordination requires some venue. I personally dislike Slack and other IM systems for incident coordination, and would rather use a voice channel like Zoom or even a telephone conference bridge. It’s important for the response to have a “main thread of execution,” as it were. Slack is fine for a backchannel, but it’s much harder for the incident commander to keep everyone on the same page when discussions get lost in threads and everyone is typing at once.  However, many organizations do use it successfully.

  6. Who else is involved? (I’ll talk about who in a minute.)

Every incident email will need at least a description, severity, phase, and incident commander.  And every follow-up email should include at least the incident’s current severity, phase, and incident commander, including any changes.

There are several roles besides the incident commander that are important in an incident of any seriousness and in organizations of any significant size.

  1. In incidents that affect people outside the organization (for example, customers or the public), someone needs to communicate with them about what is happening, receive questions or problem reports from them, and coordinate any action which is required of them. Here that person is called the customer liaison, although in large incidents they may be the leader of their own incident team within the customer support department.

  2. If an incident is sufficiently serious, another role is to communicate with the business and executive teams about what is happening and field their questions and reports, coordinate public messaging with the customer liaison and the PR or media team, coordinate any legal review or regulatory notification which needs to take place, and anything else that might impact the business as a whole rather than as purely a technical enterprise. There’s nothing scarier in an incident than when the CEO shows up and starts hounding individual engineers or the incident commander for an update. Appointing one trusted individual (often themselves part of upper management) as a business liaison between the incident team and upper management helps to ensure that the business gets the information it needs to respond to the business consequences of the problem and that technical responders are free to respond to the technical issue.

  3. Finally and most important are the people who are hands-on-keyboard to understand and fix whatever has gone wrong, who are subject-matter experts (SMEs) in whatever technologies and systems are involved. They can be anyone at the company, not just people with titles like Engineer or Scientist—if your organization has declared an incident to coordinate your response to a corporate espionage operation, who are calling people in your organization and showing up on-site, trying to steal sensitive information, one of the incident SMEs may be the engineer who manages the phone system, but another SME may well be the person who staffs your front desk and serves as your first line of defense. It may seem counter-intuitive, but the incident commander should almost never also serve as an SME. The incident commander’s role is to coordinate. If they are trying to fix the problem at the same time, heads-down, it is too easy for them to stop communicating with other stakeholders and lose sight of the bigger picture. This also means that incident commanders don’t need specific technical knowledge of the systems involved in the incident. I’ve often successfully run incidents involving systems I’d never heard of before the incident. Asking good questions, listening to the answers, sorting out disagreements, and collaboratively deciding on a path forward are the most important skills of a successful incident commander.

Not all incidents will need to have all roles filled. For instance, if an incident has no customer impact, then it’s unlikely that a customer liaison is needed to resolve it. However, customer support teams should always be aware of and able to participate in ongoing incidents. It’s often hard for backend technical teams to fully anticipate customer impact, and it’s easy to think that something doesn’t have customer impact when in fact it does.

I’ve often had people on the customer team notice that there’s an incident happening, connect it to complaints coming in from the field, and show up to help me fix it. Conversely, I’m always afraid that I’m only going to find out in the incident review meeting that customers were very mad, the company lost millions of dollars of business, and customer support felt that there was nothing they could do about it. The former is a much better outcome than the latter, and it’s only possible if customer support teams are able to see and participate in incidents.

It’s important to be clear that these are all just tools to communicate. There is no fundamental ground truth that makes an incident, say, severity 2 instead of severity 1, or severity 3 instead of severity 4.

Most companies will establish rough guidelines (say, a severity 2 incident has the possibility of losing us $1M or greater) but severity ratings exist as much to communicate how the incident commander wants other people within the organization to respond to the incident as to communicate about the incident itself.

When I was at Akamai, I would often use this to guide how I prioritized incidents. If I got asked to run an incident and immediately wanted someone in management to help me explain to the C-suite what was going on, well, that had to be at least a severity 2 incident, because all severity 2 incidents and greater were required to have a business liaison.

We would tell incident commanders not to spend more than two minutes picking a severity when they were writing the initial incident email—it was always possible to raise or lower the severity after the fact. The important thing was to get people communicating and coordinating quickly.

If you and your organization adopt this form for your incident communications and use it consistently, you don’t need any special tooling to run your incident process, although you can have it if you want. What’s important is that you clearly and consistently communicate the essential information about what’s going on, both to everyone involved in the incident and to the rest of the organization. Because what distinguishes organizations isn’t whether bad things happen at all—it’s how well we respond to them when they do.

A special thank you to Nelson Elhage, Riking, and others for their helpful feedback on an early draft of this article.

Hi there 🙂 I’m Kevin (@kevinriggle). While I learned to code while I learned to read and have loved computers my whole life, I realized pretty quickly after college that I wasn’t going to be happy spending the next forty years of my life sitting in a cubicle in a corner writing code and not talking to anybody. After a bit of a sojourn I landed in security, which I love because it’s the intersection of the hardest technical problems we know (like cryptography) and the hardest social problems we know (like making cryptography usable by people). I got my start in Infosec at Akamai, where I helped to redevelop our incident management process, train incident managers, and served as an incident manager myself on some gnarly and interesting incidents. These days I have my own little shop, and I’m available for consulting on incident management as well as several other areas of security, privacy, and the broader emerging field of software safety. In my “spare time” I make videos on security and safety topics. 😉

More stories

About The
ReadME Project

Coding is usually seen as a solitary activity, but it’s actually the world’s largest community effort led by open source maintainers, contributors, and teams. These unsung heroes put in long hours to build software, fix issues, field questions, and manage communities.

The ReadME Project is part of GitHub’s ongoing effort to amplify the voices of the developer community. It’s an evolving space to engage with the community and explore the stories, challenges, technology, and culture that surround the world of open source.

Follow us:

Nominate a developer

Nominate inspiring developers and projects you think we should feature in The ReadME Project.

Support the community

Recognize developers working behind the scenes and help open source projects get the resources they need.

Thank you! for subscribing