All customer stories

How incident.io helped Giant Swarm take the stress out of managing incidents

Through automations, incident.io has enabled Giant Swarm to standardise their incident management process, minimising the stress of running incidents.

Key Benefits

  • Standardising the process through automation
  • Improved working with Customer Success
  • Better post-incident insights, more quickly
Now we can generate insights on incidents within minutes, instead of wrangling spreadsheets for hours.
Joe
Joe
VP Engineering

Giant Swarm provides cloud-native infrastructure, offering a fully managed, open source Kubernetes platform to enterprise companies and helping them to implement digital transformation.

Giant Swarm has a team of 80 employees, working fully remotely.

The challenge

As a fully remote team, having a structured, formalised approach to managing incidents has always been very important to Giant Swarm. Before incident.io, the team would use OpsGenie as their alerting tool and then turn to Slack, using a single “incidents” channel to kick off and run the response process. However, the approach was still fairly ad hoc, and reliant on the small number of individuals who were familiar with the process to be around to guide.

The major challenge that Giant Swarm wanted to solve was maintaining a consistent approach to incident response as the team grew.

We found when we were bringing new on-call engineers in there was an awkward part with onboarding and upskilling, particularly with a distributed team. Telling people: "You have to read the P1 incident process, learn it and be able to do it at 3:00 AM when you've just been woken up" isn’t great.

Incidents are inevitably high pressure situations, and the team found that a lack of clarity in their existing process made decision making stressful, particularly for new joiners.

What were they looking for in an incident management tool?
  • Help to alleviate the pressure and stress from the team during incidents
  • Automations that could help to standardise incidents

The solution

Standardising the process through automation

Automations and workflows have enabled Giant Swarm to standardise their incident management process.

This has provided multiple benefits to the team:

  • Bringing consistency to incidents, making it easier to get a clear overall picture.
  • Reducing the cognitive load on the team during an incident, minimising stress.
  • Creating a scalable model, making it much easier for new-joiners as the company grows.

We’re now able to encode the process itself which really helps new on-callers. We are approaching it in terms of making our on-callers' lives easier, and reducing that stress, allowing us to make changes to our process without having to retrain everyone. I’d estimate that we save at least 3 hours onboarding each new oncaller, and expect that time saving to grow as we further incorporate incident.io into our processes.

Improved working with Customer Success

incident.io has helped Giant Swarm to get Customer Success representatives more involved with incidents. This has been useful in two ways:

  • The Customer Success team can now quickly and easily understand what is happening in an incident without interrupting engineers working to resolve the issue. This allows them to deal directly with customers in the heat of an incident
  • Customer Success have found the post-mortem generation particularly beneficial, quickly providing them with the information they need to go back to customers to explain what went wrong and why after an incident has been closed

Our time to generate RCAs for customers has gone from over an hour to several minutes, thanks to incident.io.

Better post-incident insights, more quickly

Giant Swarm have found that incident.io has helped to improve insights into incidents, by making it easier and faster to complete post-incident review and follow-up.

Custom fields mean we can generate insights on incidents within minutes, instead of wrangling spreadsheets for hours. We’re also much better at capturing follow-ups after an incident - before we probably captured 40% vs 90% now that we’re using incident.io.

giant-swarm
About the interviewee

Joe is VP Engineering at Giant Swarm, aiming to drive engineering operational excellence across the company.

Joe

Joe

VP Engineering

Industry
Cloud computing
Customer since
March 2021
Company size
51-100

You may also be interested in

Operational excellence starts here