For Craig Kinloch-Melia and his team at Bold Commerce—incident.io is arguably the most flexible tool in their tech stack.
Not only is it their go-to for incident response, but it’s also leaned on for a wide spectrum of other work: product launches, security issues, Black Friday preparedness, maintenance events and more.
For them, the idea that incident management tools, like incident.io, are designed exclusively to “break in case of emergency” is a missed opportunity.
At Bold Commerce, incident management isn’t just reactive work for common and critical errors. It’s an opportunity to funnel tasks and fast-moving projects into a well-defined process that allows for better collaboration, learning, and outcomes across the board. incident.io powers this every step of the way and enables proactive work that helps prevent incidents entirely.
On board since the early days of incident.io
As the Head of Technology at Bold Commerce, Craig oversees some of the biggest technical —and most impactful—projects at the company. As the go-to checkout solution for global companies, Craig is one of the main leaders at Bold who is responsible for ensuring the product is ready and available at all times.
So, given the scope of Bold Commerce’s product, there’s a lot at stake, and incident response plays a big part in ensuring smooth sailing at all times.
As an early adopter of incident.io, Craig understood the importance of having a robust incident management and response process in place.
“We've actually been incident.io customers before it even existed. We used the open-source version of it for a long time,” says Craig in reference to the early version of incident.io built by CPO Chris Evans when he was at Monzo Bank.
“At first, we only used the open-source version for incidents. It was a great way of coordinating responses. But just before Black Friday, our director of software said, ‘Let's try to use this for Black Friday and Cyber Monday.' We tried it, and it was like, ‘Yeah, this is so much better than the old way we were managing things.’”
What started off as an experiment would very quickly balloon to cover a wide variety of use cases.
Workflows, improved
Prior to using this early version of incident.io, Craig and his team would coordinate response processes through Slack, but this was often done in random and disparate channels.
“We'd have people randomly Slack each other if there was an issue. We had a 911 channel that people would post into, and then that message would trigger an alert to the operations team. They’d then have to set up an incident channel even if it wasn’t an operational incident,” says Craig.
Once incident.io was further along in its development roadmap, one feature in particular piqued Craig's interest and helped everything else fall into place: Workflows.
“For us, with Workflows, it was like, ‘Okay, now we're getting more control. Now, we're getting nudges that can be customized. We now have different types of incidents we can develop.’ And as we looked at the development roadmap more broadly, we thought, ‘There could be other ways to utilize incident.io here,’” says Craig.
“And so all of those things added to the fact that other workflows can go through this. Just because it's called incident.io doesn't mean it's just for incidents. It's actually a workflow tool for us at this point.”
It may be unconventional, but it really works
Now, Craig and his team at Bold Commerce are using incident.io in a handful of creative ways that utilize the platform outside of core incident response.
For them, it’s simple: anything that needs a robust, tried-and-true process where folks need to communicate and work on something collaboratively is a prime candidate for incident.io.
Thankfully, a lot of very important work falls under this category. But none of this would be possible without a solution that was intuitive, seamless, and easy to use.
Security incidents
For a platform like Bold Commerce, security incidents are ones that need to be nipped in the bud with extraordinary precision and speed. So when incidents like these strike, Craig and his team look to incident.io to help them through.
“If there's anything that’s a data issue, we have a default private incident that's generated, and then we use that channel to record everything about the incident. Sometimes everything in that private incident channel goes into a legal document as well,” says Craig.
“incident.io is doing all of the incident recording for us. I think some of the new AI features are actually putting in a lot of that information we know is important rather than us having to pull it all from Slack and put it into a document. We have a timeline that we can pull and be able to put that in along with the documentation summaries that we do.”
Tiger teams
Sometimes, at Bold Commerce, there are instances where something that isn't an incident needs to be responded to quickly. For moments like these, a Tiger Team comes together, and an incident channel gets created to manage the ask.
“It's not necessarily a break in something, but maybe a feature request has come in that needs a whole bunch of people to get together quickly. When it’s something that needs a fast response, we'll create an incident channel where people are coming in, brainstorming, and quickly coming up with the solution and getting it out there,” says Craig.
Black Friday & Cyber Monday
Two of the most stressful times of the year for e-commerce platforms like Bold Commerce are Black Friday and Cyber Monday. During these busy days, everything needs to go according to plan. Any significant disruptions can mean the loss of millions of dollars in revenue.
Needless to say, all hands are on deck to deal with incidents quickly, and C-suite folks are especially tuned in to ensure that anything that does come up is moving toward resolution.
“For us, this is where incident.io comes into its own. We have a channel where people go for Black Friday. This means that my co-founder, Eric, and I don't have to sit around and look at a channel all day saying, ‘ OK, has an incident come in?’ says Craig.
“We actually have set up a Workflow to say, if you're working on Black Friday and create an incident, just ping it straight over to PagerDuty and just page both of us.”
The result of this automation is peace of mind to know that if something has gone wrong, they’ll know about it right away without any latency.
Product launches
For engineering and product teams, pushing a new feature live is a huge moment. But during launch day, so many things can go wrong, so it’s important to nail it from start to finish and not get lost in the excitement.
“If we’re launching something big, we create an incident. It’s a single place of contact while we’re going live to be able to coordinate actions rather than just creating a one-off Slack channel. But the use case isn’t just about creating a channel for us,” says Craig.
“It’s about actually allowing us to be able to record and document a lot of the things that are going on as well to be able to put it into a post-mortem document. We also create follow-ups as well. So once we’re done with the launch, we can go into incident.io, look up a list of follow-ups, and say, “This one didn’t get completed. Is that deliberate?’ And just make sure we close the loop on everything.
Planned maintenance
Finally, while planned maintenance is a regular occurrence, it can be quite disruptive. And for folks who aren’t looped in, it can be hard to know and figure out whether or not an issue they’re having is related to ongoing maintenance. To be proactive about addressing some of these issues, Craig and his team create maintenance incidents to keep everyone informed about what’s going on.
“Normally, with planned maintenance, there's documentation ahead of time, but there’s a maintenance window that we've created. We create an incident for it that lets everyone know that this is what's happening,” says Craig.
“So then if anything weird happens, teams can look and say, ‘Yeah, there's maintenance going on.’ I can jump into that and then quickly see if there's anything related to this, and we can correlate very quickly any fallout from maintenance. It also keeps people informed along the way.”