Engineering

Organizing ownership: How we assign errors in our monolith

At incident.io, we run on a monolith.

This brings a whole load of benefits that we don’t want to give up any time soon. We don’t have to worry about the speed of internal network requests, complex deployments, or optimizing work that touches multiple services.

This blog post isn’t about the relative benefits of monoliths though (but we’ve written more about that here if you are interested)!

Ownership in monoliths is tricky.

Telling the right people about errors is really important. No-one wants to be paged up in the middle of the night for code that they don’t have context on! Micro-services naturally provide clear boundaries between code, but in a monolith it can feel like everything is in the same big bucket, without a sense of who owns things.

In this post, I’ll share exactly how we link our code to the team that owns it, so errors and alerting are routed to the right place with minimal maintenance burden. It’s changes like this that allow us to keep moving quickly and enjoying working with our monolith, as our organization and codebase continues to scale.

Defining ownership

What is the highest level of abstraction in your codebase that can reasonably be assigned to a single team?

To have clear and easily traceable ownership we want to divide our codebase up into chunks that can sensibly be assigned to a single team.

For us, the core business logic in our codebase is split up in a few ways, each with ownership assigned to a single team.

  1. Apps /apps/... ****These are any applications that we run separately to our monolith server. This includes our status page Vercel app, or our React web dashboard
  2. Packages /server/app/... Core backend packages grouping a single feature or section of business behavior. This could be follow-ups, custom fields, or billing
  3. Integrations /server/integrations/... Any integrations with external providers. For us this includes Sentry, Jira, Datadog, etc.

Every single one of the subfolders in these packages should be assigned to a single team.

If you’re splitting up ownership for the first time this will be difficult. Newer features have clear dividing lines, but older code, or shared packages can be more political. Getting together as a team and agreeing on an initial ownership that balances pager load and context is the best way to get to an agreement.

Lower level packages that contain application logic or control how your code is run don’t need to be assigned, since they’re always called by some higher level code in our packages that do have owners assigned.

Encoding your ownership

Now that we have clear definitions of ownership within our codebase, we need to define it.

For us, this looks like having a single required module file in the root of each subfolder. We write these in jsonnet, but you can use any language you prefer.

// module.jsonnet
{
  name: std.thisFile,
  // Owner is the team that owns this package - all errors will be routed to them 
  owner: 'on-call',
  // Criticality is how important is this package for customers
  // We’ll define rules around required reviewers or test coverage based on this
  criticality: 'medium',
  // Features are a list of real customer facing features this package powers
  // So we can connect our backend packages to what our customers understand. 
  features: [
    'holidays',
    'on-call-pay',
  ],
}

Make these module files required. Every one of our “features” must have an owning team. We have a CI step that ensures every single package we care about has clear ownership, meaning if someone adds a new feature package, there is no way of them forgetting to tag it.

MUST_HAVE_OWNERS=(
  apps/*
  server/app/*
  server/integrations/*
)

Roll up these values into a single CODEOWNERS file

To easily search for the owner of a particular package, we want to roll up all these package ownerships into a single file.

// CODEOWNERS

/server/app/action @incident-io/response
/server/app/ai @incident-io/post-incident
/server/app/alert @incident-io/on-call

Write a script that searches for all module files and drops the package along with the owner into this file, and run it in CI.

Routing errors

Application errors are sent from code, to monitoring tooling (like Sentry) to alerting tooling (like incident.io).

To correctly route our errors, and make decisions about how to handle them, and who to page, we need to have a team tagged on it as soon as it leaves our app.

We’ve just defined clear owners for each of our packages, so our next step is to apply this to any errors that we’re throwing.

1. Extract team from stack trace

When errors leave our app, we want to assign a particular team in our error wrapping middleware (where you normally might apply tags, format your error, and send it to the right place). Our aim is to apply the correct tags so our error knows where to go in its onward journey.

We want to find the lowest level team assignment we can, since that is closest to the root of the problem, so we’ll keep recursively searching until we get there.

// InferMetadata uses the module owners and feature lists to determine for a given error,
// which team is responsible and what is impacted.
func InferMetadata(ctx context.Context, err error) (team *string, resultErr error) {
	// Recursively search down our error stack
	// We want to always assign the team that has the deepest error
	// So we unwrap our error and find the team assigned to that
	childTeam := InferMetadata(ctx, Unwrap(err))
	
	// If we found a team attached to the child of our error, let's use that 
	if childTeam != nil {
		return childTeam
	}
	
	// Otherwise lets find the team for the current level of error we're on and use that
	return findTeam(err)
}

// findOwner takes an error and finds the lowest level team
// That owns the package it came from
func findOwner(ctx context.Context, err error) (team *string) {
	frames := err.StackTrace()
	
	// Go through each error frame and calculate the owning team
		for _, frame := range frames {
			// 1. Find source file from our error frame
			// 2. Find entries in CODEOWNERS where the owned package exists in our source file path
			// 3. Pick an owner by finding the packages which are involved in our source file path
		}
}

2. Tag team on error

Next, we assign the team we identify to all errors that we’re sending outside of our app.

Use whatever the standard approach for tagging your errors or logs are, to add another tag that can be parsed in your monitoring or alerting tooling.

// team_override allows us to manually override the team on a particular error if needed
if fields["team_override"] != nil {
	fields["team"] = fields["team_override"]
} else if team != nil {
	fields["team"] = *team
}

Defining fallbacks

Your routing should always be best effort. Whilst most of your errors will come from your tagged modules, it’s essential that you don’t drop errors that don’t nicely fit into your defined ownership.

There will be scenarios where you can’t find a team, and it’s important to handle them in the rest of your stack.

For us, that looks like routing any unowned alerts to our On-call team in our alerting tooling (incident.io). You might want to handle the case where no team is assigned differently in the different places you handle your errors, so I’d recommend leaving these unowned errors unassigned as they leave your app, and treating this field as optional when you’re rooting them.

Summary

Monoliths can be great, but having all your code running in a single place can make defining ownership tricky.

By breaking our codebase into clearly defined chunks, and enforcing ownership by a specific team, automated error routing can be really easy. The maintenance burden is tiny for this - it just works! Enforcing ownership through CI and monitoring tooling means it’s impossible for someone to write new code and forget to assign it to themselves.

Tagging errors by team means routing them is smooth, reducing the burden on on-call engineers and keeping things ticking. It’s this system that allows us to scale while still enjoying the benefits of our monolith.

Picture of Martha Lambert
Martha Lambert
Product Engineer

Modern incident management, built for humans