Article

How to improve incident triaging for better organization-wide incident response

Incidents happen all day long—big and small, significant and trivial. Because of this, it’s important to have a dedicated process in place to streamline incident response and resolve issues as quickly as possible to avoid pile ups.

But what about when you aren’t exactly sure if an incident is, well, an incident? This is where triaging comes into play.

Triaging ensures that your incident response teams are prioritizing issues that actually have impact potential and aren’t going through the motions for issues that are insignificant.

In this post, I’ll talk through the incident triage process and explain how you can leverage it to improve your incident response—don’t worry, triage isn't as scary as it sounds.

On that note, we thought it was worth reminding you about our position on incidents. We’ve covered this before, but we feel that it’s most sensible to expand what qualifies as an incidents—from security incidents like data breaches, DNS or phishing attacks, or everyday issue like your checkout page being broken.

When you expand what gets categorized as an incident and lower the bar to what counts as an incident, you allow your organization to rally around a single process to respond to issues more predictably. No more ad hoc fixes or lack of visibility into how things are being resolved.

With that, let’s talk about triaging incidents.

What's incident triage?

Incident triage is the process of prioritizing potential incidents. The triage process begins as soon as you realize that an incident might be occurring. Once in the triage stage, a responder can then come in and investigate further and diagnose accordingly.

If the responder determines that it’s a true incident, it starts to go through the normal motions, otherwise, it gets flagged and demoted to a non-issue.

In short, incident triage allows response teams to investigate the scope and scale of events without prematurely elevating them to full-blown incidents.

Actionable guidance from real incident responders

Looking for expert advice to level up your incident management knowledge? Sign up to get the latest content from the incident.io team.

Once the triage process elevates an incident, team members know it deserves their full attention. As a result, they don't have to worry about wasting precious time rushing to resolve low-impact events.

Incident triage improves incident response by organizing the tremendous amount of investigation responders have to do once an incident is declared.

In the end, triage protects organizations from wasting limited resources in the wrong pursuit. It funnels higher-tiered responders to the appropriate incidents and resolves the rest.

How triaging works in the incident response process

Triage resides in the preliminary stages of incident response plans. In short, it puts human eyes on everything before getting funneled through the latter stages of the incident response.

Responders sort through an endless mass of alerts from an observability tool such as Datadog and divide suspicious events into two categories: false positives and possible incidents. They then decide whether to solve the issue on the spot or tie it to other events and elevate its priority.

Responders then document findings and actions throughout the triage process and when it ends.

Incident triage steps can vary depending on several factors such as an organization's priorities. However, whatever an organization's approach to triage, most adhere to the following workflow.

Step #1: Detecting a potential incident

The first stage of the incident management triage process is detecting and declaring potential issues. This typically happens through an observability tool like I mentioned above. Next, responders collect data about events to then figure out if it’s significant or not.

Step #2: Starting an investigation

The incident responder then uses the collected data to trace and find the root cause of the incident. If found, the responder quickly assesses the event's scope and potential impact.

At this stage, responder teams begin work to move the event to the proper category and assign the appropriate incident type. They then examine clues and try to determine the nature of the event. Moving to the next step requires knowing the origin and size of the event. They also look for other systems vulnerable to the same issue.

The team examines similarities to other incidents they may have dealt with in the past as well. Then, the team can then choose to escalate the event to incident status or call it a false positive. Unless a responder can classify the strange event as explainable and harmless, the event becomes an incident.

For us, anything that impacts customers in some way is always becomes an incident. But it typically gets triaged if the incident is being declared by a non-engineer.

This is because these folks typically don’t have the technical context behind an incident, so it’s important to triage it to ensure that it’s not just a minor bug.

Step #3: Determining priority

In this stage, response teams perform a complete business impact assessment. First, they estimate the potential harm to the organization's stakeholders from the event.

Then, they calculate the impact so far. Major incidents posing the most severe consequences garner the most attention. The triage team puts them at the front of the line for resolution. This is also the stage where, if it’s a true incident, severity and priority levels will be assigned.

Step #4: Resolving the incident

The triage team resolves false positives and other harmless or straightforward issues. But when appropriate, the triage team escalates the incident by assigning it to the proper incident response team.

As always, triage teams should consider customer impact and reputation control when deciding the resolution response.

Either way, triage has now finished with the incident. They have either resolved it or assigned it to the right team with the appropriate priority level.

The incident triage process for documentation now requires summarizing each activity related to the event. This essential final triage step ensures responders can detect future incidents more quickly and with less effort.

Don’t skip over documentation!

With so much going on, documentation may be pushed to the back burner. However, this may lead you to chase the same false positives and regularly occurring events.

As a company, we’ve written quite a bit about how we’ve built a culture of writing, and we encourage other businesses to do the same. With a heavy emphasis on writing across your organization, documentation will be better and clearer, and your response process will be better for it.

How to triage your incidents with incident.io

We’ve highlighted the importance of adding a triage step to your incident response to better filter out issues that aren’t, well, incidents.

By adding this small, but important step, you give your response teams more time back to focus on incidents of greater significance and potential impact.

With incident.io, triaging incidents is easy. Triaged incidents aren’t considered “live” and will require a few extra steps before they’re escalated or closed out.

An incident channel will still be created, but instead of working towards a resolution like you would for a typical incident, you would use the space to figure out if it’s actually an incident. Once you’ve decided how to move forward, you can either:

  • Accept the incident and begin your typical response process
  • Merge the incident if the incident is a duplicate of another one
  • Decline the incident if you conclude that the triage incident is not actually a problem

Triaged incidents can still be escalated, assigned roles and Workflows, and will in many ways look like your typical incident resolution environment. But if you choose to decline or merge a triage incident, we’ll:

  • Condense the announcement post so it doesn't take up room
  • Remove the incident from your Insights statistics

In the end, this helps lower the bar for creating incident channels but still lets you bring the right people into the room to diagnose the issue.

You can learn about our incident triage feature here or sign up for a demo to see the magic in action.

Picture of Luis Gonzalez
Luis Gonzalez
Content Marketing Manager

Operational excellence starts here.