Through automations, incident.io has enabled Giant Swarm to standardise their incident management process, minimising the stress of running incidents.
Giant Swarm provides cloud-native infrastructure, offering a fully managed, open source Kubernetes platform to enterprise companies and helping them to implement digital transformation.
Giant Swarm has a team of 80 employees, working fully remotely.
As a fully remote team, having a structured, formalised approach to managing incidents has always been very important to Giant Swarm. Before incident.io, the team would use OpsGenie as their alerting tool and then turn to Slack, using a single “incidents” channel to kick off and run the response process. However, the approach was still fairly ad hoc, and reliant on the small number of individuals who were familiar with the process to be around to guide.
The major challenge that Giant Swarm wanted to solve was maintaining a consistent approach to incident response as the team grew.
We found when we were bringing new on-call engineers in there was an awkward part with onboarding and upskilling, particularly with a distributed team. Telling people: "You have to read the P1 incident process, learn it and be able to do it at 3:00 AM when you've just been woken up" isn’t great.
Incidents are inevitably high pressure situations, and the team found that a lack of clarity in their existing process made decision making stressful, particularly for new joiners.
Automations and workflows have enabled Giant Swarm to standardise their incident management process.
This has provided multiple benefits to the team:
We’re now able to encode the process itself which really helps new on-callers. We are approaching it in terms of making our on-callers' lives easier, and reducing that stress, allowing us to make changes to our process without having to retrain everyone. I’d estimate that we save at least 3 hours onboarding each new oncaller, and expect that time saving to grow as we further incorporate incident.io into our processes.
incident.io has helped Giant Swarm to get Customer Success representatives more involved with incidents. This has been useful in two ways:
Our time to generate RCAs for customers has gone from over an hour to several minutes, thanks to incident.io.
Giant Swarm have found that incident.io has helped to improve insights into incidents, by making it easier and faster to complete post-incident review and follow-up.
Custom fields mean we can generate insights on incidents within minutes, instead of wrangling spreadsheets for hours. We’re also much better at capturing follow-ups after an incident - before we probably captured 40% vs 90% now that we’re using incident.io.