Engineering

Building a great developer experience at a startup

At incident.io, our number one priority in engineering is pace. The faster we can build great product, the more feedback we can get and the more value we can deliver for our customers.

But pace is a funny thing. If you optimise for pace over a single month, you’ll quickly find yourself slowed down by the weight of your past mistakes. Similarly, if you prioritise pace at the expense of quality, your customer support burden will sharply increase and you’ll be delivering less value than before.

To keep pace as a team, leverage is critical (see more here). Finding leverage allows you to ratchet up the pace, without having to pay the cost back in either technical or product debt.

We’ve found many ways to build leverage, but one of the most impactful has been an explicit focus on developer experience. These are the foundations on which we’ve built our team, and we believe will continue to support us for years to come.

Here are some of the things we’ve done to improve our developer experience, and help our team ship great product at pace:

🔁 Short feedback loops

Good developer experience is all about shortening feedback loops, so you can quickly see the impact of your changes. Having to wait to find out if something you’ve written actually works is incredibly frustrating, and it forces you to break focus. To build momentum when shipping code, you want to quickly build, test, iterate over and over again.

Some of the things we’ve done to shorten our iteration loops:

  • Having a great hot-reloading setup across back and front-end, so that changes are almost instantly reflecting in your development environment.
  • Keeping CI speedy so you know quickly if your PR passes all the tests & linting rules (ideally below 3 minutes; it has recently crept up past that but we’re keen to get it back down again!)
  • Keeping our reviews as quick as is reasonable: we try to get PRs reviewed within an hour or two, and prioritise reviewing other folks’ code over almost anything else.
  • Being able to register a Slack preview which uses Slacks’ block kit builder to show you what your message or modal would look like, without having to run the entire codepath.
  • A package called toolbox which allows us to write code that can be run from the CLI, both locally or in staging/production. In production, this is useful for running backfills or other code that we don’t want to expose over an API. When developing, it’s great to be able to run a command to (for example) emit a particular event, or run a specific Slack command. As an example, to run a specific nudge like 'do you want to take a break':
go run cmd/toolbox/main.go run-single-nudge \
  --incident 123 \ # which incident should it target
  --organisation 456 \ # which organisation should it target
  --name take-a-break \ # which nudge should we run
  --ignore-applies \ # don't check if the nudge is applicable, run it anyway

💻 Making computers do what they do best

We use strongly-typed languages in both our backend (Go) and frontend (Typescript). We try to write code in such a way that our compiler can help us out wherever possible. That means that while we’re writing code, our editor can tell us what we’ve done wrong. It also helps us avoid shipping bugs into production, as the compiler can immediately tell us what we’ve done wrong.

  • We’ve written a small library to give us a type-safe way of interacting with the database.
  • We use goa as a web framework to generate both our OpenAPI spec and our types. That means we have type safety across the back and frontend (by auto-generating a typescript client from our OpenAPI spec) and can’t accidentally send incorrectly formed API requests. It also auto-generates docs we can use to help navigate our API!
  • We use enums extensively in our internal API spec, allowing us to lean on Typescript to alert us if we haven’t handled a particular enum value. We use a pattern called assertUnreachable, which looks something like:
switch (status) {
    case ActionStatusEnum.Outstanding:
      return "Open"
    case ActionStatusEnum.Completed:
      return "Completed"
    case ActionStatusEnum.NotDoing:
      return "Not doing"
    default:
      assertUnreachable(status);
  }

🧱 Building blocks that you can rely on

We rely on a few key well-tested abstractions (such as the engine), which we use as building blocks when building other features. It’s easy to over-abstract, or build abstractions too early that end up hindering instead of helping. We try to wait a little longer than is comfortable before building the abstraction, to help us get it right and have confidence it’ll be useful.

Some key examples are:

  • Engine conditions, a way of describing a set of incidents which match a particular set of conditions (e.g. where severity ≥ major) which powers features like workflows, policies and incident types. We’ve also got front-end components we can re-use to enable users to configure conditions in lots of different places.
  • Actors, a consistent way of describing who has done something in our system (was it a user? a workflow? an API key?).
  • A framework for building Slack modals, allowing us to put modals together quickly by handling the translation layers from easily-readable structures to the structures required by the Slack API.
  • Two hooks called useFetchData and useMutation, which make requests to our API from the frontend, parse errors and handle loading states. For example, calling onSubmit from the code below will make the API call to create a new incident, parse the errors and (if possible) use setError (provided by react-hook-form) to provide the user with a contextual error. If it can’t parse the error (e.g. a 500), it’ll return genericError which we can also render for the user. While it’s making the API call, saving will be true, so we can display a spinner instead of the submit button.
const [onSubmit, { saving, genericError }] = useMutation(
    async (formData: CreateIncidentRequestBody) => {
      const { incident } = await apiClient.incidentsCreate({
        createRequestBody: formData,
      });
      return incident;
    },
    {
      onSuccess: (incident) => {
        onComplete(incident)
      },
      setError,
    }
  );

🕵️ Making bug-finding easy

As much as we’d love to spend all our time building things, a lot of engineering is about tracking down (and squashing) bugs. So making this fast is just as important as making it quick to build new features; maybe more so. We do this by:

  • Everyone has their own realistic development environment, using real GCP infrastructure instead of emulators, to help reproduce issues locally and get us all familiar with our infrastructure.
  • We lean heavily on traces, both in our production environment but also in our dev environments, to help us understand the context of errors.
  • We push sourcemaps to Sentry (our bug tracker) for both front and backend, so that you can jump into the code easily from the error.
  • We push lots of structured data into our loglines and errors, so that you can gather the information you need quickly. For example, all our errors have both the organisation_id and organisation_name on them, so you can quickly see who’s impacted (is it one organization? many organizations?). You can also navigate directly from an error in Sentry to the relevant trace.
{
  // basic information about the source of this logline
  "level": "info",
  "ts": 1664214156.252743,
  "caller": "mw/observe.go:269",
  "role": "server",
  "component": "web",
  "event": "http_request",
  "msg": "📡 Goa handled HTTP request",

  // enabling us to link the logline to the relevant trace and span
  "span_id": "6d4cbc6d22dbe528",
  "trace_id": "911a198797eb4b5c25bc02e288fb4293",
  "trace_url": "https://console.cloud.google.com/traces/list?project=incident-io&tid=911a198797eb4b5c25bc02e288fb4293",

  // information about this specific request
  "duration": 0.025155625,
  "endpoint_service": "Webhooks",
  "endpoint_method": "GitHub",
  "http_status": 204,
  "http_duration": 0.025155625,
  "http_method": "POST",
  "http_path": "/webhooks/github",
  "http_bytes": 0,
  "ip": "127.0.0.1:60875",
  "request_id": "MeuYlpMI",
  "user_agent": "GitHub-Hookshot/3116550"
}

That’s a bit of a whirlwind tour of some of the things that accelerate us here at incident.io. Over time, we hope to write up (and maybe even open source) more and more of them.

We’ve got a laundry-list of other improvements we’d love to make, but we’re also pretty happy with where we’ve got to so far. We have to make tough choices about developing in this kind of work vs. shipping features; this stuff is only worthwhile if we go ahead and use it to ship great product. But we firmly believe that investing in the right areas of developer experience are going to be key to our success as an engineering team.

Picture of Lisa Karlin Curtis
Lisa Karlin Curtis
Technical Lead

Operational excellence starts here