Measuring incident impact can be nuanced: what are the worst kind of incidents? One that impacts the most customers? Had the most monetary impact? Was most severe?
One measure is totally undeniable, though. And that's how much time people spent trying to resolve the incident.
Ignoring business specific aspects, time spent handling operational work is time not building product or serving customers. It represents a normally invisible cost of supporting the service you already provide, and can be spread across many people even in a single incident.
If you could assign an hourly value to work spent on incidents, how many questions might that help you answer?
There are a few ways to measure incident time, but only one approach gives a truly accurate insight into the complexity and cost of an incident. Here’s a look at the options:
Duration from start to finish: This relies on accurate start and finish times, which can be somewhat subjective. Do we start the clock when the impact began, or when the incident was declared? While this measure shows how long the incident lasted, it doesn’t capture how many people were involved or in what capacity.
Duration multiplied by the number of people involved: This offers a more detailed view of the aggregate time but still carries subjectivity—especially if not everyone was actively engaged. For instance, if five people join just to stay in the loop, should their time be counted?
Active person-hours: This approach tracks individual time spent by each person, and aggregates it across the incident. Though it can be tricky to measure accurately, tools like Microsoft Teams and Slack (for tracking written activity) and Zoom or Google Meet (for call duration) make it technically possible to get a precise view.
When we discuss workload metrics, we focus on accurate tracking of actual person-hours spent on incidents.
We’ve kept this guide generally free of incident.io specifics, but given the complexity of calculating workload accurately without it, the remainder of this section includes some details specific to the incident.io platform.
For incident.io customers, we provide a measure of time spent on an incident in the form of 'workload', and allow you to split that by the dimensions you assign to incidents, from custom fields to incident severity.
Split by whether that time was spent in working hours, late in the evening or when we'd expect people to be sleeping, the last year of incident.io's data looks like this:
You can read this chart as how many hours a week we spent dealing with incident related work. It's interesting that you can see details about our team represented in the graph, such as:
If you're tagging incidents with metadata, like affected functionality or services, and impacted customers, you can slice workload to understand the biggest contributors of operational work by whatever dimension is relevant.
For incident.io, one of the primary drivers of incident work is third-party integrations. Either adding a new integration or extending an existing one, working with third-party APIs across hundreds of customers installations generates bugs, from which we create incidents.
So a question we ask of workload is "which integrations create the most operational work?".
Plotting our roadmap against workload split by the "Affected Integration" custom field:
It's easy to see that:
Having this data means we know what to expect next time we work with Jira/Confluence, and can factor that into our project planning. It's also a nudge to check we didn't overload our support staff during these weeks, and confirm we didn't miss other issues while this work was on-going.
It's often easy to 'feel' incident themes as you work them, and use your last week/month of memory to judge what might be causing problems. But this comes with recency bias and very few people see every incident, even before considering how people's memory fades.
Workload fixes this, and helps capture even the slow-burn operational issues that can eat away at productivity over long periods of time. It's one of the most valuable uses of incident data, and can help inform discussions about technical debt and risk during product planning.
If you're like us, seeing this data for your team would be fascinating. But we'd struggle to trust it until understanding how it was measured, as time-tracking tools can often be very wrong.
In simple terms, we generate workload by watching activity taken against an incident, inferring activity as signal that someone is actively working on it.
The rules are:
If we see someone message the incident channel at 10:11am, we'll immediately assign them 10 minutes of workload.
If that person was to update a Zendesk ticket attached to this incident at 10:15am, we'll adjust to assume 4m of activity (from 10:11 to 10:15) and allocate another 10 minutes after.
For the average responder focused on specific incidents, this provides an accurate picture of their time. But for incident managers who work across many incidents at a time, we apply a final calculation to trim their workload to ensure we never overlap workloads across incidents, preventing us from calculating more than one hour of work in any hour.
We're confident this calculation is representative of individual contributions, and can be used as a valuable proxy for effort put into an incident.
Not every hour of work is equal when it comes to incidents, especially when you might be asking people to work outside their normal hours.
You may have noticed the workload chart split hours by:
This is a dimension we apply to calculated workload to understand when a period of work happened relative to the users timezone, which we align with their Slack timezone at the point this workload is seen to have occurred.
Slack is a great source of timezone data as it syncs with your devices automatically, helping ensure accuracy and making this data useful even for teams who regularly travel internationally.
Workload really is the gold standard of "how bad was this incident", avoiding the many pitfalls that exist for other methods.
Beyond "where is incident work coming from?", you might ask:
It's become a key measure we use across all our insights at incident.io, and one that will only improve as we build integrations which involve other parts of your organization such as Customer Support and Sales into incidents, and can account for their involvement too.