Building On-call: Continually testing with smoke tests

With the release of On-call, our system’s reliability had to be solid from the outset.

Our customers have high expectations of a paging product—and internally, we would not be comfortable with releasing something that we weren’t sure would perform under pressure.

While our earlier product, Response, was the core of a customer’s incident response process after an incident was detected, we’re now the first notification an engineer gets when something’s wrong.

This needs to be bulletproof: our number one product feature is the fact that when things go wrong, we’ll always page you.

As part of this change in product offering, we introduced a number of technical and organizational changes to center reliability as our most critical technical goal, while also making it possible for us to develop improvements and enhancements to our product day after day without breaking things.

A key component of this was a library of Smoke Tests which continually test core journeys of our product. In this post, I’ll explain how we’re using them, and what we learned from building our own Smoke Test framework.

Our Smoke Test suite

Our Smoke Test suite consists of a number of tests which run through a variety of scenarios throughout our product: from simple things like ingesting an alert, to ensuring that a user is paged in the right way when that alert is set to escalate to them.

Every minute, we run this full suite of tests in production and staging environments. These run end-to-end in actual environments, so they exercise our infrastructure and third party integrations. If any of these tests fail consistently, we are alerted about them and can proactively investigate.

Alongside this continuous background testing, we also run these tests on every change to our application, both when pull requests are raised, and as that merged change is deployed. In these contexts, the tests run using containers on single machine, so they run quickly and haven’t had any affect on our ability to deploy regularly.

We also test a number of weirder edge cases that prove elements of our system are fundamentally sound. For example, we let users parse alerts in our system using JavaScript—but what if that JavaScript is malformed? Do we still ingest those alerts?

It’s the combination of these tests running on pull requests and continuously that let us make changes to even the deepest parts of our escalations system, and still be confident that the application is operating as expected.

Running these tests as part of our code change process is great, but it's key that we’re running these tests continually against our production and production-like environments. If we’re introducing a change, it’s not enough for us to know that they work in some hypothetical test environment that we spin up as part of a deployment—we need to know that the change has landed in a safe way.

It’s the impact of those changes on actual users of our product that matter to us. While we use infrastructure as code to define our environments and can spin up new ephemeral environments when needed, there are always nuances in your production environment, whether that’s from the scale or shape of your production data, that are hard to test against.

By testing in production (after we’ve carried out our first rounds of technical due diligence) we’re faster aware of the realities of the change we’ve just introduced for our customers.

Learnings

Start with a clean slate

It’s tempting to follow the following pattern: I have a test which creates some entities, then ensures that those entities I’ve created are as I expect. I then, at the end of the test, destroy those entities, because I don’t want to leave them hanging around.

This works okay—until the state you’re testing gets more complicated. What if your test fails, but it got halfway through running?

Then you’ve left some half-done configuration hanging around in your environment. It’s often the case that this prevents further successful runs of your Smoke Tests.

Repairing this in a production-like environment is really difficult: your tests are failing because they can’t bootstrap themselves, which means you’re having to manually intervene to make repairs; you’re reaching for CASCADE …

The easiest way to resolve this is by having every test start with the act of tearing down anything left from previous test run(s). The simple act of doing this totally removes this "Frankenstate" monster from the equation.

You can even write tests for your cleanup function to ensure it’s deleting everything you’d expect it to!

Use your standard rails

While starting with a clean state means that you’re partly insulated from weird database state impacting your test’s execution, you can avoid this happening in the first place if you’re sensible in how you access the database.

It’s tempting to use this access to peek behind the curtain, to put your hands on the scales. What if I create this little bit of config using a direct update to the table? What if I set up my test using my ORM?

If you’ve sensibly built abstractions in your codebase above direct database access, you should tread with caution here. Your abstractions likely offer extra guarantees and validation regarding what actions it’s acceptable to take against the database.

Your tests should use those methods.

By doing that you’re testing a much more accurate version of your system’s behavior with all the high level guard rails you and your team have developed. This not only tests your system in a more faithful way, but also means that you’re less likely to leave weird state from test runs around.

These were sensible abstractions when you wrote them—use them!

Test your user’s assumptions

One of the tests in our suite tests the most simple end-to-end path in On-call: We attempt to create an alert, and when that alert is created, make sure we get escalated to as we expect.

It’d be tempting to just check the state of the database here: “Does the alerts table contain the alert we expect? Yes! What about the escalations table? Also yes! All good!”.

However, this alone isn’t enough to guarantee that your system is working correctly from a user’s perspective.

When users are interacting with a system like On-call, their expectations are often based around how quickly certain things are happening. If I create an alert, how quickly does that escalation come through?

Our system is asynchronous, and often depends on external providers to do things like sending notifications. While we have monitoring and alerting against all of these external dependencies, a "sum of the parts" is often the best measure for determining if the system is healthy or not.

As part of matching this user expectation, our tests are using blunt time-based metrics for determining if a test is passing or failing. If we’re taking a long time for an alert to progress through our system—our users are not seeing the behavior they expect—something’s wrong!

By thinking in this manner you’re not just proving correctness, which you can reliably test through other mechanisms, but what the live out-in-the-world system that users are actually depending on is doing.

Think about what your user’s expectations of your system are: what would cause them to think "this feels a bit broken"?

Prepare to unearth the unexpected

As engineers, it’s regularly the case that we’re working with systems built from design decisions and assumptions that were correct at the time but eventually need to be rethought as the system scales out.

You codify those assumptions as you’re developing your applications. "No organization will ever have more than a hundred schedules” goes from a belief that’s held unchallenged within the team to manifesting itself in a tight loop over an organization’s schedules in the lifetime of a HTTP request.

Building a Smoke Test suite that continuously runs as part of a new product launch is a great way to test those assumptions out into the medium to near future.

After running constant traffic through our alerts system for a period of time, some cracks began to show: previous database migrations which were instantaneous began to take longer, batch operations we could previously do in an unsophisticated way required a bit more thought regarding their performance.

If you’ve got a customer that’s a big user of your product, very quickly, their traffic profile could exceed the boundaries of your continuous testing.

This change in thinking regarding these teething issues wasn’t just confined to our team: at incident.io, we have a monolith and On-call is deeply integrated into the rest of our product. This means that changes introduced for a single team’s operations can have a large impact on the rest of the engineering org. Running automated tests was a great way in getting the whole organization to think about our unstated assumptions—we helped level up the entire organization because of this technical interdependency.

These were all things that existed in our system and, if successful, we would have hit very quickly. It’s just that those teething issues would’ve had an upset customer on the end of the line, not an automated system you control.

Conclusions

I hope that’s been a useful overview of how we’re using and thinking about smoke tests at incident.io.

Our Smoke Testing setup is just a small part of the overall reliability posture we’ve adopted, including numerous technical and cultural changes, that were all part of releasing On-call.

Releasing a product that’s such a core part of customer’s technical and business operations is a big step for us—we needed to be totally comfortable that the system we built is robust and that we’re comfortable making changes as we continue to develop and improve the platform at pace.