Measuring incident impact can be nuanced: what are the worst kind of incidents? One that impacts the most customers? Had the most monetary impact? Was most severe?
One measure is totally undeniable, though. And that's how much time people spent trying to resolve the incident.
Ignoring business specific aspects, time spent handling operational work is time not building product or serving customers. It represents a normally invisible cost of supporting the service you already provide, and can be spread across many people even in a single incident.
If you could assign an hourly value to work spent on incidents, how many questions might that help you answer?
Introducing workload metrics #
For incident.io customers, we provide a measure of time spent on an incident in the form of 'workload', and allow you to split that by the dimensions you assign to incidents, from custom fields to incident severity.
Split by whether that time was spent in working hours, late in the evening or when we'd expect people to be sleeping, the last year of incident.io's data looks like this:
You can read this chart as how many hours a week we spent dealing with incident related work. I think it's interesting that you can see details about our team represented in the graph, such as:
- We've grown a lot in the last year and onboarded the majority of our customers, leading to an increase in operational work.
- We use incidents even for small bugs, and assign one person fulltime per-week to handle incoming work. It's unsurprising to see the last few months hover around 30hrs, which is about one person's worth of work.
- Incidents are usually caused by change, and you change less when fewer people are working: hence workload restarting as we come out of the summer holidays (Sept 22).
What's generating incidents?
If you're tagging incidents with custom fields, you can slice workload to understand the biggest contributors of operational work by whatever dimension is relevant to your business.
For incident.io, one of the primary drivers of incident work is third-party integrations. Either adding a new integration or extending an existing one, working with third-party APIs across hundreds of customers installations generates bugs, from which we create incidents.
So a question we ask of workload is "which integrations create the most operational work?".
Plotting our roadmap against workload split by the "Affected Integration" custom field:
It's easy to see that:
- "Jira Sync" introduced a feature that synced incident data into a parent Jira ticket, and caused a substantial amount of long-tail work as we adapted for each customers specific Jira setup.
- "Postmortems Track and Trace" built exporting post-mortem documents to Confluence, causing several bugs around permissioning and formatting as customers adopted it.
Having this data means we know what to expect next time we work with Jira/Confluence, and can factor that into our project planning. It's also a nudge to check we didn't overload our support staff during these weeks, and confirm we didn't miss other issues while this work was on-going.
It's often easy to 'feel' incident themes as you work them, and use your last week/month of memory to judge what might be causing problems. But this comes with recency bias and very few people see every incident, even before considering how people's memory fades.
Workload fixes this, and helps capture even the slow-burn operational issues that can eat away at productivity over long periods of time. It's one of the most valuable uses of incident data, and can help inform discussions about technical debt and risk during product planning.
How is it measured? #
If you're like us, seeing this data for your team would be fascinating. But we'd struggle to trust it until understanding how it was measured, as time-tracking tools can often be very wrong.
In simple terms, we generate workload by watching activity taken against an incident, inferring activity as signal that someone is actively working on it.
The rules are:
- If we see incident activity, such as a message in the incident channel or work in integrations like issue trackers, we'll assume the related person has spent 10 minutes working on this incident.
- If the same user takes another action for the same incident within 20 minutes of the last, we'll assume they've been working this incident continuously since the last time we saw them.
As an example, if we see someone message the incident channel at 10:11am, we'll immediately assign them 10 minutes of workload.
If that person was to update a Zendesk ticket attached to this incident at 10:15am, we'll adjust to assume 4m of activity (from 10:11 to 10:15) and allocate another 10 minutes after.
For the average responder focused on specific incidents, this provides an accurate picture of their time. But for incident managers who work across many incidents at a time, we apply a final calculation to trim their workload to ensure we never overlap workloads across incidents, preventing us from calculating more than one hour of work in any hour.
We're confident this calculation is representative of individual contributions, and can be used as a valuable proxy for effort put into an incident.
Working/late/sleeping hours #
Not every hour of work is equal when it comes to incidents, especially when you might be asking people to work outside their normal hours.
You may have noticed the workload chart split hours by:
- Working: 8am-7pm
- Late: 7pm-11pm
- Sleeping: 11pm-8am
This is a dimension we apply to calculated workload to understand when a period of work happened relative to the users timezone, which we align with their Slack timezone at the point this workload is seen to have occurred.
Slack is a great source of timezone data as it syncs with your devices automatically, helping ensure accuracy and making this data useful even for teams who regularly travel internationally.
Gold standard #
Workload really is the gold standard of "how bad was this incident", avoiding the many pitfalls that exist for other methods.
Beyond "where is incident work coming from?", you might ask:
- Is incident workload increasing or decreasing?
- What are the services that generate the most incidents?
- How has workload changed in proportion to my team size?
- Are we seeing out-of-hours workload increase?
It's become a key measure we use across all our insights at incident.io, and one that will only improve as we build integrations which involve other parts of your organization such as Customer Support and Sales into incidents, and can account for their involvement too.