The Incident Way

The Incident Way is our philosophy and core beliefs about incident management. It spans everything from defining what an incident is to leveraging automation, and learning from incidents to using data to gain insights.

Incidents are urgent, reactive work

For many, incidents are large events that happen infrequently. We see the world differently. We believe an incident is anything that takes you away from planned work with a degree of urgency, however big or small.

Whether your entire service is offline, or you’re dealing with a minor bug affecting your largest customer, if you’re dropping other work to manage it straight away, it’s an incident.

Incidents are for everyone

Whilst they often start in engineering, incidents are rarely resolved with engineering effort alone. Customer support, executives, legal people and many others play a significant role in running incidents effectively.

Everyone can raise the alarm

Reporting incidents is a shared responsibility, and everyone should be encouraged to declare early and often when they see something wrong. It can be daunting to flag an issue, so whether it’s a real issue or a false-alarm, reporting should be recognised and celebrated.

Incidents should be managed in the open

Incidents are best deal with in the open, for all to see. Doing so allows for cross-pollination of approaches and best practices, and learning across domains.

Public incidents should be the default, and private ones the exception.

Incident management is a communication and coordination problem

Great incidents are founded on great communication, and friction that gets in the way of that should be minimised. Tools and processes should should fit around your communication, and not the other way around.

Automation and AI is a human accelerant, not a human replacement

Automation has a place in incidents, but managing incidents is a fundamentally human process. No system can provide the necessary adaptability and resilience of humans.

Systems like AI are powerful accelerants, which should be used to aid decision making and make humans more efficient, not to replace them entirely.

Being open with customers builds trust

Publicly disclosing incident details is as a way to build customer trust. Whilst counter-intuitive to some, the act of sharing is a net positive for most businesses.

Where possible, communication with customers should be common and in open channels like status pages.

Learning is the goal and actions are a healthy a by-product

Incidents provide a valuable and concentrated way to learn about technical and organizational risks and blind spots. Incidents data can be used as an organizational observability tool, allowing you to see how things truly work, and not just how you imagine them.

The goal of a healthy post-incident review to more deeply understand what happened, and catalyze learning. Preventing reoccurrence through action items is an important and encouraged by-product of this process.

Data is always the starting point for an investigation

It can be tempting to draw conclusions from data alone, but incident data is usually the starting point for a deeper investigation. Raw data can be helpful to spot trends, but it fails to capture the nuances of complex and messy human actions, and rarely serves as the answer in itself.

Models are helpful to develop a shared understanding at speed

No tool can perfectly model the complexities of an incident, but severities, statuses and other data models are helpful aids to develop a shared understanding.

Is this Major or Minor? Are we in the ‘Investigating’ or ‘Fixing’ stage? The answer is rarely objective, but having a model to catalyse the conversation and assist with decision making process is a net positive.

Chapter list