Engineering

Managing your resources in Terraform can be literally easy and actually fun

The Problem

We approached building a Terraform integration with a sense of trepidation. One of the things that motivates us is building features we think people are going to love using, and Terraform integrations are often not that.

Terraform integrations have a number of common pitfalls. Building resources by hand is tedious, and requires deep understanding of their specification. Importing and managing existing resources is also often painful. It can be unclear which resources are being Terraformed, leading to state drift as others update them elsewhere.

Inevitably, when you need to make a change in a hurry, you end up having to debug discrepancies in a part of the state you don't care about in that moment.

There must be a better way.

The Solution

We think we’ve ended up with a solution that’s pretty great, and makes us feel good about encouraging our customers to use Terraform for their incident configuration. If you want a quick TL;DR you can refer back to, see the summary at the end.

Resources have a visual builder

Many of the resources that you might want to Terraform are pretty complex: indeed, that complexity is exactly why you want to control them in Terraform in the first place.

In our case (and likely yours too!), we already had a visual builder in the form of our Workflows and Schedules UI, so it made sense to allow our users to generate Terraform configuration using a flow they were already familiar with.

This is an easy and straightforward way to create new resources, and is a much more pleasant and fast experience than writing out terraform configuration by hand. It means you don't need to know what the specification looks like, and also allows us to provide validation, so you can be confident it's always going to be accurate.

In the case of On-Call Schedules, this came with the added bonus of being able to use our Preview: you get a clear, visual representation of the result of your changes without having to commit them. This is especially valuable in critical areas like On-Call.

We also chose to dedicate time to ensuring the generated configuration was readable, so it'd be easy to understand once copied out of the UI. Small things like comments labelling the harder-to-understand parts of your spec can go a long way.

steps = [
    {
      # "Assign the incident follow-up to a user"
      id   = "01HY306G7ZDQW16XH3SPFEM37B"
      name = "follow_up.assign"
      param_bindings = [
        # "Follow-up"
        {
          value = {
            # "Follow-up"
            reference = "followup"
          }
        },
        # "Assignee"
        {
          value = {
            # "Incident → Reporter"
            reference = "incident.role[\\"01HM6HNBF0GK0ZJF3SAC8CEA5A\\"]"
          }
        },
      ]
    },
  ]

Users can import and update existing resources using the visual builder

Being able to generate new resources with a visual builder is of limited use if you can't go back there to make changes! For us, this looked like displaying managed resources alongside those created by hand, allowing you to use the visual builder to make changes and simply re-generate your new configuration when you were done.

Lastly, we were aware many users had existing resources they'd want to migrate to being managed via Terraform. You can use the same flow as creating for this, with added steps for importing your existing resource into your state, and telling your system that it’s now managed externally.

To make this process as smooth as possible, we’ve automated this second part: you'll never need to tell us you've imported a resource: the moment you run terraform apply, we'll update its management information automatically.

It's clear which resources are being terraformed, and where they're being managed from

I mentioned earlier that we display your terraform-managed resources alongside those managed in our UI. This might have given you pause: surely someone's going to edit one of my workflows and leave me with an out of sync state.

This is a common gotcha, and one we were keen to avoid without forcing you to make an all-or-nothing choice between Terraform and the UI.

Why might you want a mix?

Resources might be managed by a diverse group of people, with differing requirements. For example, at incident.io we use Schedules for all sorts of things: Schedules for urgent support might need to be managed in terraform, but the ”Which product manager owns the Changelog this week” rota doesn’t, and requiring our PMs to use Terraform to update it would make them and us sad.

So, to avoid confusion, we:

  • Made it obvious when a resource is being externally managed
  • Allowed you to specify where it's being managed so people can easily find the configuration if they do want to make a change
  • Prevented you from using the visual builder to make permanent changes to externally managed resources.

There's a "break glass in case of emergency" option

Preventing accidental mutation of externally managed resources is great, and can save you lots of pain debugging a de-synced state. It comes with risk, though: you could end up unable to push changes during an incident or emergency due to state being out of sync, even if the conflicts are unrelated to what you need to fix.

We think it's important to have an escape hatch in situations like this, so we made it possible to "un-claim" a resource from within the visual builder. This allows you to persist your urgent changes there and then, leaving the state cleanup until after the incident has passed.

Future-proofing

Once we started implementation, we quickly realised this was likely to be something we might want to do more of. We’d be making our future-lives much easier (another thing we care lots about at incident.io!) if we decoupled a resource’s management information from the resource itself.

To accomplish this, we created a managed resource abstraction.

{
  "managed_resource": {
    "managed_by": "terraform",
    "resource_id": "abc123",
    "resource_type": "schedule",
    "source_url": "https://github.com/my-company/infrastructure",
    "annotations": {
      "incident.io/terraform/version": "3.0.0"
    }
  }
}

This allows us to have a centralised way of keeping management information up to date, serialising it alongside our resources, and checking whether they are internally or externally managed. It also means we can add management information to existing resources without having to touch their data model at all.

Annotations give us a flexible way of storing additional information related to whatever is managing the resource - for example, knowing which Terraform version is responsible for a Workflow might be useful if we run up against any compatibility-related rough edges in future.

Summary

Here’s a quick TL;DR of what we think you should keep in mind when adding Terraform support to your resources

  • Provide a visual builder for both creating and updating resources
  • Make it easy for users to import existing resources
  • When generating config, a little readability goes a long way
  • Make it clear what’s being Terraformed and what isn’t, and where that configuration is kept
  • Externally managed resources should be immutable in your UI…
  • …but provide an escape hatch
Picture of Lisa Karlin Curtis
Lisa Karlin Curtis
Technical Lead

Modern incident management, built for humans