Keeping it boring: the incident.io technology stack

At incident.io we run a deliberately simple technology stack. Keeping things boring has allowed us to scale from a few hundred customers to several thousand, while having only two platform engineers. In this post I'll walk through the stack, explain some of the choices we've made, and touch on the challenges we're facing as we grow.

Our cloud provider

Historically, parts of the stack ran on Heroku, but as we scaled it made sense to move to Google Cloud Platform (GCP). We chose GCP because many of our early engineers had familiarity with it, and many of the core primitives it provides are simpler, nicer abstractions than those found elsewhere, e.g. running containerized workloads..

Our compute stack

Kubernetes

Our default for compute is GCP's Kubernetes offering: GKE Autopilot.

We chose GKE Autopilot as it manages the underlying nodes for us. We don't have to worry about scaling, forecasting, or patching, which lets us focus on deploying the workloads that keep incident.io running.

We chose Kubernetes over other methods of running containers in GCP, as it provided a better experience around the debugging and observability of our applications.

There are trade-offs, of course. Since GCP manages the nodes, we have no control over their lifespan, and by extension, the lifespan of our pods. There is a 10 minute grace period when a node is removed from service. You can apply an annotation to extend it, but it isn't a silver bullet.

We also don't have access to some of the sensitive system namespaces, so we're reliant on whatever software GCP chooses for core parts of Kubernetes. This bit us recently: we couldn't change the default DNS settings for GKE Autopilot and had to implement a workaround in the workload itself.

While GKE Autopilot has some limitations, it has overall been a net positive which has allowed us to focus on more impactful work.

Virtual machines

Not everything runs in Kubernetes. Some workloads have specific requirements that make Kubernetes an unattractive option, so we run them on virtual machines instead using plain Google Compute Engine instances.

These requirements vary, but fall into a few broad categories (and often workloads will dictate more than one of these): long lived processes, large storage requirements, or large resource requirements.

Our virtual machines are configured with Cloud Init, as opposed to more complex configuration management tools, meaning that hosts are treated as immutable and short-lived, rather than long-lived with changes continually applied across its lifetime. Applying or reverting a change is as simple as merging a pull request.

Database

Postgres is our choice for application database, so the natural choice here is GCP's Cloud SQL. Other databases are available on GCP's platform, but often come with various tradeoffs or higher complexity, and are not as simple and battle-tested as a plain Postgres instance.

For our Cloud SQL deployments, at least in production, we use the "Enterprise Plus" tier. This gives us much higher availability guarantees and less downtime for maintenance events, which reduces the amount that we need to architect these concerns into our application logic .

Much like GKE Autopilot, Cloud SQL lets us focus on what matters rather than the minutiae of managing database servers.

Queuing

Our application architecture is highly event-driven, and so a large proportion of the compute workload is driven through asynchronous tasks. We use GCP Pub/Sub as our message queue that powers this.

Unlike most of our other infrastructure, we manage our Pub/Sub topics and subscriptions in application code rather than traditional infrastructure-as-code like Terraform. This means a product engineer can create and start using a new topic or subscription with a single code change, rather than needing to coordinate across multiple systems.

Deployments

Argo CD

We use Argo CD to deploy our Kubernetes resources. It fits nicely into the GitOps workflow that we follow for our Kubernetes resources: all of these are managed in code, alongside "snapshots" to show the templated-out manifests, and when a pull request is merged, Argo CD automatically applies any changes.

We chose Argo CD as it has one job, and it does it well: it's quick, well-architected and has a nice UI too. It is also a widely used tool, so there is a lot of community support and ongoing development.

Buildkite

Buildkite is our CI/CD tool. That means it orchestrates tasks like building our app, running tests, triggering deploys, etc.

We run large virtual machines in our GCP account, and Buildkite's control plane assigns jobs to the agents on those machines. If you want the full story on this, we wrote about it in The quest for the five minute deploy.

Keeping CI/CD fast is an ongoing project. As our engineering organization grows, and AI tooling allows engineers to be more productive, the amount of work our pipeline has to handle keeps increasing.

Terraform

Terraform manages our cloud infrastructure. Defining infrastructure in code lets us preview changes before we make them, and gives us an audit trail to review when things go wrong.

To deploy Terraformed changes, and ensure good separation of privileges, we utilize Spacelift as the deployment runner. Of course we could just build our own processes for this around existing CI/CD infrastructure, but Spacelift provides a load of useful primitives like access control, queuing and a developer-friendly UI, making it well worth the money.

Monitoring

Grafana Cloud is our monitoring provider, covering metrics, logs, traces, and profiles. Running a full observability stack takes a lot of work, and using Grafana Cloud allows us to run a lean platforming team.

While we store our metrics, logs and profiles in Grafana Cloud, we store our traces ourselves.

Tempo is Grafana's trace storage engine, and we run it ourselves on a virtual machine. We self-host because the size and volume of our traces would be prohibitively expensive to keep in Grafana Cloud.

To ship telemetry data to Grafana (and Tempo), we use Grafana Alloy. Our application and other components are configured to emit data in Prometheus or OpenTelemetry format which Alloy natively handles.

We configure Grafana to alert on the data we collect. Important alerts get forwarded to incident.io (naturally), and lower priority alerts go to Slack.

We manage our alert definitions in code, as these are just as critical to get right as application logic. We've chosen to use Jsonnet here, which abstracts away the complexity of Grafana's alert configuration, making it straightforward for engineers to define alerts. Terraform then deploys the alerting resources in Grafana from the Jsonnet output.

Conclusion

For the most part, our stack is boring and that's by design. We use the fewest pieces of technology needed to get the job done. We lean on third-party providers for things outside our core competencies, which lets a small platform team support a growing engineering organization. But as you can see, there are areas where we've taken different approaches in order to tailor the platform towards the needs of our workloads and engineering team.

As incident.io continues to grow, we're going to run into more of these challenges, and we'll need more platform engineers to tackle them. If that sounds like your kind of problem, take a look at our open roles.