How ready are you for your next incident?
That's a question most organisations want to answer, because as things change – tenured employees leaving, launching new products, or hiring new teams – your state of readiness might change. It hurts to find out only when the next incident hits, and it takes much longer to resolve.
Incident data can help measure this, and help prompt teams to take proactive steps when readiness decreases.
Recent experience is readiness #
As the best preparation for incidents is practice, you can measure readiness by tracking how many of your team have responded to different types of incident over time.
Taking an example of this measured over the last 12 months for the incident.io team:
This chart shows how many people have led incidents, been assigned specific roles in those incidents, or just responded to incidents within the last 35 days.
By using a 35 day window to capture recency, we can use this measurement to track who is ready to take on a similar role should an incident happen now. It's great at uncovering periods when even very experienced responders have managed to avoid an incident for some time and might need retraining, or showing a decrease when tenured responders leave the company.
Other than a dip around Christmas due to a decrease in incidents, our numbers look quite healthy. We have about five people with recent experience of leading incidents in the last few months, which – for our team of 15 engineers – means we'll always have someone working who knows how to follow our processes.
But this isn't the whole story: it's easy to have a broad range of people who deal with typical incidents, and much harder to ensure a variety of people lead the more serious incidents.
Let's filter this for incidents where severity is Major and above, and focus on those who've lead incidents:
Despite us 4x'ing the team over the last year, the number of people responding to serious incidents is remaining the same. That implies we're not effectively growing the number of people who can deal with those incidents, which is a concern for us.
Proactive steps #
Now the numbers have shown only a small group of people have experience solving major incidents, we can take proactive action to improve the situation.
At incident.io, that means we've scheduled some Game Days where we'll simulate technical issues in our staging environments, responding just as we normally would for a real incident but using a test incident type. We'll make sure anyone without recent experience of major incidents participates in these drills, increasing the number of people trained to respond when the next incident appears.
Proactive training is a great way to reduce risk when tenured employees leave teams, or before launching new products.