Improved CI/CD

One of our company values is ‘Raise the pace.’ We are constantly looking for ways to speed up and get value out to customers faster.

As our company has grown, the time it took to deploy a change had risen to 13 minutes.

While trying to improve this initially, we ran into rough edges with our CICD that prevented substantial time savings:

Parallelizing our checks was hitting an unacceptable cost to speed ratio
We were unable to effectively use the Go build cache
We were unable to effectively use our Yarn modules cache

13 minutes was too slow for us, and with our hiring plans, we knew now was the time to invest in improving things.

Engineer speed improvements

Firstly, we invested time in making it quicker for local development.

While building a feature, we regularly run a suite of checks to ensure it works and meets our standards. An engineer will run these checks at least once per change they are making, so time here can add up quickly.

The big change we made was running our CICD checks on servers we own. This gave us more control of the resources we needed to run things. This allowed to us parallelize much better, make use of caching and removed start-up costs associated with checks.

We managed to half the amount of time taken here, bring it from 8 minutes to 4 minutes.

Deployment improvements

Improving things for ourselves is only half the battle, we also needed to make it quicker to get these changes out to you!

A change being available has multiple stages:

Run the full suite of checks again
Deploy the change to our pre-production environment
Deploy the change to our production environment
Run some post deploy checks and steps

By building on the savings we made for engineers, and some deploy-specific changes, we are now able to get changes out to you in 7 minutes, down from 13 minutes.

We will be writing a more in-depth blog post on this work in coming weeks, so stay tuned if you are interested in learning more!

What else we’ve shipped

New

You can now run a backstage sync from catalog-importer in 'dry-run' mode to understand what changes will be made when you run the sync
Escalation paths in Catalog now have an 'all users' attribute, containing everyone on that escalation path, both directly and via schedules
We now support zelt.app's calendar feed for displaying holidays on schedules
Adding a branch in an escalation path now duplicates the existing path to the "else" branch, and converts your level nodes to low urgency
You can now filter by created_at in our Alerts API

Improvements

Adjusted the spacing on the on-call schedules page to prevent columns from overlapping
When changing your expression for alert priority in an alert source, the preview now reflects that change
The escalation timeline now shows acknowledgements that are more than a minute apart as separate items
We now warn about unsaved changes when you hit Esc while editing a summary
Improved documentation for usage of channel configs on an alert route in Terraform
You can no longer click Add without selecting a country when attaching public holidays to a schedule

Bug fixes

Fixed a bug where you couldn't add another escalation rule to some alert routes
You can now navigate through to alert attributes from the variable picker when setting custom fields on an alert route without the popover closing
Suggested summaries no longer cause unpredictable cursor behaviour when edited