Login Request a demo Start your free trial
Further reading

Level up your incident response with incident.io

Learn more

If you're looking a quick entry point, or a round-up of the key points in the guide, this playbook is a great place to start.

On-call #

  • On-call isn’t just for engineers: consider who else you might need in an emergency; they should be on-call too.
  • Invest in your training process: onboarding new on-callers well is critical: each on-call rota should define a clear path to becoming ready to ‘be on call’, including learning domain specific content as well as your incident response process.
  • Pay anyone that’s on-call: compensate them for the inconvenience of being on-call. We recommend paying per hour spent on-call, and adjust your compensation based on expectations.
  • Be compassionate and understanding:
    • Allow on-call teams to define their own schedules that best suit them
    • Use overrides to give on-callers flexibility, and relieve pressure when things get tough
    • Look out for anyone taking too much of the burden

Foundations #

  • Create a shared understanding of an incident: an incident is anything that takes you away from planned work.
  • Declare more incidents: using your incident process frequently means that, when things go really wrong, you’re processes will run like a well-oiled machine.
  • Use 3-5 human-named severities: plain-english words such as minor, major and critical are easier for everyone to understand.
  • Every incident should have a lead: whether there’s one responder or 30, someone has to play the lead role and drive the incident to a resolution.
  • Only use the roles you really need: you can often lean on actions (and your incident lead) to understand who’s doing what.
  • Collect structured data: this allows you to communicate clearly during the incident, drive automation and gather insights after the incident has been resolved.

Response #

  • When you declare an incident:
    • Create a fresh space which you can use to co-ordinate your response.
    • Announce the incident in a shared space so everyone’s in the loop
    • Assemble the team that you need to start investigating
  • As you respond to an incident:
    • Identify what’s broken & understand the impact
    • Mitigate the immediate impact
    • Take a pause
    • Resolve the issue
    • Close everything off, and assign follow-up actions
  • Send regular, easy to digest, internal updates: using a predictable format helps busy stakeholders get the context that they need. Long gaps between updates can cause confusion or stress.
  • Show your working: document your response in an incident channel, even if you’re the only one there. It’ll help you avoid bad assumptions or mistakes, and helps your team learn from what you’ve done.
  • Keep your customers in the loop: clear and frequent communication builds trust, and can turn a negative into a positive. Use simple language, tell everyone what you’re doing and what they should do in the meantime.
  • Structure your thinking: use questions and theories to methodically work through a problem, being clear about any assumptions you make along the way.
  • Calm is contagious: take breaks and keep everyone well fed so your incident response can stay on track, even on the bad days.
  • When you’re remote, over-communicate: to avoid a fragmented response, make sure everything is in one place (the incident channel) and it’s really clear who’s doing what.

Learn and Improve #

  • Hold a debrief when there’s value: the responders for an incident should have a good idea whether a debrief will be valuable. If it becomes mandatory ‘red tape’, they’ll become a useless checkbox exercise.
  • Make debriefs truly blameless: start with the assumption that everyone came to work to do their best, and don’t hold individuals accountable for systematic failures.
  • Value the conversation over the artifact: having a document is a useful way to share knowledge asynchronously, but the most valuable part of a debrief is usually the conversation that precedes it.
  • Use incidents to level up your team: they broaden your horizons and teach you how to build resilient systems. Bring junior members into incidents, so that your teams get the full value from them.
  • Be transparent: building a transparent culture means that stakeholders and customers will trust you and give you space to fix what’s broken.
  • Practice your incident process: just like any other skill, practice makes perfect. Dry-run your incident process regularly to get everyone up-to-speed and find the rough edges while the stakes are low.

Insights #

  • Value of incident data: when done right, aggregating incident data can provide unique insight into team health, and help identify trends or patterns that proactive action can positively impact.
  • Tracking workload: tools like incident.io automatically track incident activity, which can be used to calculate time spent on incidents. Unlike MTTR and similar metrics, time spent resolving incidents is great proxy for incident 'pain', and can help confirm where operational work is coming from.
  • Operational readiness: the best preparation for incidents is practice, and historical incident data can help identify how many of your organization have recent incident experience, and whether that is going up or down.
  • Pager load: on-call rotas are disruptive, but we should aim to minimise disruption where possible. Combine data from paging tools with incidents to understand who is being paged and when, helping catch when someone should be offered cover or rest after a bad night.