Building safe-by-default tools in our Go web application

At incident.io, we're acutely aware that we handle incredibly sensitive data on behalf of our customers. Moving fast and breaking things is all well and good, but keeping our customer data safe isn't something we can compromise on.

We run incident.io as a multi-tenant application, which means we have a single database (and a single application). We could opt to run an entirely separate stack for each individual customer – and indeed this would give us the most confidence in data segregation – but this is a lot more complex to to own and run, and would limit our ability to move fast and build a great product.

We've implemented a few different strategies to give us confidence that we're keeping our customer's data safe, aiming to make our application safe by default.

1. Set-up robust, automated testing

To make sure our code works, we test it. This usually starts with running the code locally or on a staging environment and 'clicking around' to see if it does what we expect, but we also rely on using some flavour of automated testing to make sure everything is behaving exactly as intended. At incident.io we run our full suite of automated unit and integration tests in CI before every single deploy.

However, this still doesn't provide the kind of watertight guarantees we want around customer data. Firstly we can only be confident things are working as expected if our tests cover all scenarios and, secondly, it still relies on engineers not writing bugs. It's easy to write a test which looks like it protects against this kind of bug, but doesn't cater for every possible case (I should know, I wrote one).

Testing is still crucial, but we wanted to remove the possibility that a mistake could slip through unnoticed. For that, we'd need something different.

2. Check for organization scoping at the API layer

We wanted something that we could build quickly, and would give us high confidence that we wouldn't encounter similar problems in the future.

Whatever we build should provide an invariant that we can rely on: we should never respond to an API request with resources without an organization scope, or a response that contains resources from mixed organizations.

We landed on a middleware: CheckOrganisationScope, which is applied by default to all of our authenticated API endpoints. This middleware expects all the resources our API returns to have an OrganisationID field on each struct. It looks for this field, and errors if either:

There is no OrganisationID field present
The OrganisationID value doesn't match the OrganisationID provided by the authentication middleware

Here is a condensed snippet of the code we wrote. You can see the full version here

// checkOrganisationScope ensures a val, which is expected to be a pointer to a struct,
// has a valid OrganisationID field.
func checkOrganisationScope(orgID string, val reflect.Value) error {
	organisationField := val.Elem().FieldByName("OrganisationID")
	if !organisationField.IsValid() || organisationField.IsZero() {
		return ErrCheckOrganisationScopeMissingID
	}

	if resourceOrgID := organisationField.Interface().(string); resourceOrgID != orgID {
		return ErrCheckOrganisationScopeIncorrectID{
			ExpectedOrganisationID: orgID,
			ResourceOrganisationID: resourceOrgID,
		}
	}

	return nil
}

We like this approach for two main reasons:

We check every single request in all environments. If someone in production finds an edge case that we haven't anticipated: we are still safe as we'll block the request.
It's in the outer layer of our application, so wherever we're getting the data from (e.g. database or cache) we still apply the check.

This helped us sleep a bit better at night! But it wasn't complete. This protects us when users are using the dashboard, but not if they're interacting via Slack, which makes up a significant part of our product interactions.

3. Enforce organization scope on database interactions

We use gorm to interact with our database. Gorm allows you to specify hooks which act as query middlewares, permitting you to modify the query or perform some check before it hits the database.

As we denormalise the organization ID onto all our database tables, it made sense to write a hook that ensures all queries have a where organisation_id = some_id. This means that it's not possible to accidentally query without filtering by org without explicitly disabling the hook.

// EnforceScopeByOrganisationID restricts read and write database queries by failing
// whenever the query lacks an organisation ID. This allows us to protect against
// accidental exposure, or mis-scoping.
func EnforceScopeByOrganisationID(db *gorm.DB) {
	enforceScope := func(operation string, db *gorm.DB) {
		if skip, _ := db.Statement.Context.Value(skipScopeByOrganisationID).(bool); skip {
			return
		}

		switch operation {
		// These operations have where conditions, which we can search for organisation_id
		case "update", "query":
			columnName := "organisation_id"
			if db.Statement.Table == "organisations" {
				columnName = "id"
			}
			value, found := hasColumnWhereClause(db, columnName, operation)
			if !found {
				db.AddError(ErrEnforceScopeByOrganisationID{buildInfo(db, operation)})
				return
			}
		}
	}

	var hookName = "safedb:enforce_scope_by_organisation"

	db.Callback().Query().Before("gorm:query").
		Register(hookName, func(db *gorm.DB) { enforceScope("query", db) })
}

With this check in place, we can be confident that all data being read from and written to the database is scoped to a specific organisation, wether the originating request is triggered from the web dashboard or from Slack.

4. Write a safe-by-default interface to our caching service

For our cache we chose a different approach, where we use the org-id as part of the cache key for all organization-scoped cached information.

Our cache.Service interface looks like:

type Service interface {
	Get(ctx context.Context, org *domain.Organisation, key string) (val string, flags uint32, cas uint64, err error)
	Set(ctx context.Context, org *domain.Organisation, key, val string, flags, exp uint32, ocas uint64) (cas uint64, err error)
	Clear(ctx context.Context, org *domain.Organisation, key string) (err error)
	GetGlobal(ctx context.Context, key string) (val string, flags uint32, cas uint64, err error)
	SetGlobal(ctx context.Context, key, val string, flags, exp uint32, ocas uint64) (cas uint64, err error)
	ClearGlobal(ctx context.Context, key string) (err error)
	Ping(ctx context.Context) error
	Quit()
}

This means that in order to query our cache without specifying an org, you have to call the ...Global methods. While might look like a very different implementation to Plan C, it's using the same approach. We're making our interfaces safe by default: to do something dangerous (i.e. query without specifying an organization) we need to opt-in to this dangerous behaviour.

Where does this leave us?

Sometimes there is no silver bullet, and you need to tackle the problem from many angles.

We're pretty happy with the safeguards we now have in-place, and feel more confident that our code can't behave in unexpected ways.

If you're looking to apply these strategies yourself, we're happy to chat about our experience @incident_io. And for anyone who wanted to try the gorm database hooks, here's a gist of our code for inspiration: safedb.go.