Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
At incident.io, our number one priority in engineering is pace. The faster we can build great product, the more feedback we can get and the more value we can deliver for our customers.
But pace is a funny thing. If you optimise for pace over a single month, you’ll quickly find yourself slowed down by the weight of your past mistakes. Similarly, if you prioritise pace at the expense of quality, your customer support burden will sharply increase and you’ll be delivering less value than before.
To keep pace as a team, leverage is critical (see more here). Finding leverage allows you to ratchet up the pace, without having to pay the cost back in either technical or product debt.
We’ve found many ways to build leverage, but one of the most impactful has been an explicit focus on developer experience. These are the foundations on which we’ve built our team, and we believe will continue to support us for years to come.
Here are some of the things we’ve done to improve our developer experience, and help our team ship great product at pace:
Good developer experience is all about shortening feedback loops, so you can quickly see the impact of your changes. Having to wait to find out if something you’ve written actually works is incredibly frustrating, and it forces you to break focus. To build momentum when shipping code, you want to quickly build, test, iterate over and over again.
Some of the things we’ve done to shorten our iteration loops:
toolbox
which allows us to write code that can be run from the CLI, both locally or in staging/production. In production, this is useful for running backfills or other code that we don’t want to expose over an API. When developing, it’s great to be able to run a command to (for example) emit a particular event, or run a specific Slack command. As an example, to run a specific nudge like 'do you want to take a break':go run cmd/toolbox/main.go run-single-nudge \
--incident 123 \ # which incident should it target
--organisation 456 \ # which organisation should it target
--name take-a-break \ # which nudge should we run
--ignore-applies \ # don't check if the nudge is applicable, run it anyway
We use strongly-typed languages in both our backend (Go) and frontend (Typescript). We try to write code in such a way that our compiler can help us out wherever possible. That means that while we’re writing code, our editor can tell us what we’ve done wrong. It also helps us avoid shipping bugs into production, as the compiler can immediately tell us what we’ve done wrong.
assertUnreachable
, which looks something like:switch (status) {
case ActionStatusEnum.Outstanding:
return "Open"
case ActionStatusEnum.Completed:
return "Completed"
case ActionStatusEnum.NotDoing:
return "Not doing"
default:
assertUnreachable(status);
}
We rely on a few key well-tested abstractions (such as the engine), which we use as building blocks when building other features. It’s easy to over-abstract, or build abstractions too early that end up hindering instead of helping. We try to wait a little longer than is comfortable before building the abstraction, to help us get it right and have confidence it’ll be useful.
Some key examples are:
useFetchData
and useMutation
, which make requests to our API from the frontend, parse errors and handle loading states. For example, calling onSubmit
from the code below will make the API call to create a new incident, parse the errors and (if possible) use setError
(provided by react-hook-form
) to provide the user with a contextual error. If it can’t parse the error (e.g. a 500), it’ll return genericError
which we can also render for the user. While it’s making the API call, saving
will be true, so we can display a spinner instead of the submit button.const [onSubmit, { saving, genericError }] = useMutation(
async (formData: CreateIncidentRequestBody) => {
const { incident } = await apiClient.incidentsCreate({
createRequestBody: formData,
});
return incident;
},
{
onSuccess: (incident) => {
onComplete(incident)
},
setError,
}
);
As much as we’d love to spend all our time building things, a lot of engineering is about tracking down (and squashing) bugs. So making this fast is just as important as making it quick to build new features; maybe more so. We do this by:
{
// basic information about the source of this logline
"level": "info",
"ts": 1664214156.252743,
"caller": "mw/observe.go:269",
"role": "server",
"component": "web",
"event": "http_request",
"msg": "📡 Goa handled HTTP request",
// enabling us to link the logline to the relevant trace and span
"span_id": "6d4cbc6d22dbe528",
"trace_id": "911a198797eb4b5c25bc02e288fb4293",
"trace_url": "https://console.cloud.google.com/traces/list?project=incident-io&tid=911a198797eb4b5c25bc02e288fb4293",
// information about this specific request
"duration": 0.025155625,
"endpoint_service": "Webhooks",
"endpoint_method": "GitHub",
"http_status": 204,
"http_duration": 0.025155625,
"http_method": "POST",
"http_path": "/webhooks/github",
"http_bytes": 0,
"ip": "127.0.0.1:60875",
"request_id": "MeuYlpMI",
"user_agent": "GitHub-Hookshot/3116550"
}
That’s a bit of a whirlwind tour of some of the things that accelerate us here at incident.io. Over time, we hope to write up (and maybe even open source) more and more of them.
We’ve got a laundry-list of other improvements we’d love to make, but we’re also pretty happy with where we’ve got to so far. We have to make tough choices about developing in this kind of work vs. shipping features; this stuff is only worthwhile if we go ahead and use it to ship great product. But we firmly believe that investing in the right areas of developer experience are going to be key to our success as an engineering team.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Keeping the codebase consistent with Pattern Parties
As a codebase evolves, it’s common to see some divergence in the design patterns within it.
Kelsey Mills
Clouds, caches and connection conundrums
During a recent infrastructure migration into Google Cloud, we kept running into a pesky issue without a clear cause. Here, we dive into the twists and turns we took to finally figure out what the smoking gun was.
Ben Wheatley
Practical guidance for getting started as a Site Reliability Engineer
Here are a few strategies that might help you build up context, find the problems that really matter and turn these into a plan of action.
Ben Wheatley