Engineering

Building On-call: Time, timezones, and scheduling

Working with time

Our On-call has been in the wild for a few months now, and in this post, I want to talk about building a time-sensitive system and what we did to handle some of the challenges. I’ll cover what our scheduler is responsible for, the basics of working with time, and talk a bit about how we tested our system.

What’s involved with building an on-call scheduler?

Our On-call product spans four areas: alerting, scheduling, escalations, and paging. This piece is going to focus on our scheduler, responsible for calculating which users are on shift at any time for a given schedule configuration.

So what does a schedule configuration look like? Our schedules include one or more independent rotations, where each rotation includes one or more on-call users and a set of rules for determining:

  1. At what times the rota is active
  2. At what times on-call duty switches over between users on the rota

On the first point, rotas fall into two broad categories: either always-on or more commonly with defined working hours, that specify at what time the rota is active for a given day of the week. A familiar use case is to configure 9-to-5 rotas that allow users to be on-call during office hours only. On 2., most rotas assign 1 or more users to a shift at any one time, on a round-robin basis.

Add in a few minor complications (e.g. shift overrides, irregular handover periods) and you’ve got a scheduler that’s flexible enough to handle most of the on-call configurations we see in the wild.

On the face of it, building a system to do this is straightforward. Modern databases and standard libraries generally have excellent tools for working with time (Javascript notwithstanding) and the basic principles and gotchas have been covered at length across StackOverflow. Having said that, it never hurts to take it back to basics (cast an eye over this excellent piece from Zain Rizvi for a reminder of just how much complexity lies behind the things that we take for granted).

Components of time

When working with time, it’s useful to be precise. In day-to-day life, we can get away with making ambiguous references to temporal concepts, as edge cases are rare enough. Daylight saving rules (DST) usually change only twice a year, and most of us change timezones infrequently enough that “let’s chat tomorrow at 10” is useful enough as an unambiguous identifier of a point in time—in this case, most of us would put something in our calendar at 10 a.m. local time on the following date.

Needless to say, when you’re building a scheduler that has to work all year round, and in every time zone, you need to think a little more carefully. Thinking about the above example, what happens if the two participants are in different time zones? What happens if their shared calendar system is in a third time zone? If DST kicks in overnight? What about if we know the local time offset, but not the exact DST rules that our companion is using? Or, worst case, what happens if the meeting was scheduled for midnight on a day on which the clocks go back an hour (we’re still not exactly sure what the right answer is to this one)?

The first step in making sense of these complexities is to be precise about exactly what we mean when we refer to a time.

Local time

The most familiar conception of time—known in computing as "wall clock" time—is the time that we deal with every day. When we think of 10 a.m., we don’t think about timezones, DST, or dates. Instead, we look at the (hopefully correct) clock on the wall and we know it’s 10 a.m. We can talk to the people around us about upcoming events and have confidence that we’re all referring to the same time.

Local time is relevant for us as our users often want to work in local time (e.g. they want rotas that start at 0900 every working day) and see outputs in their timezone.

Time offsets

Time offset refers to the difference between local time and UTC (or its closely related but slightly different cousin: GMT). Most of us come across offset time either when we’re traveling East or West, or if we live in a region that observes daylight savings rules, in which local time offsets change twice a year. Offsets can range from minus 12 to plus 14 hours (for a total range of 26 hours) for timezones in the IANA tz database.

Timezones

A timezone is a label (e.g. GMT, EDT, CET) that refers to a particular time offset. These are fixed, and for the roughly 40% of countries that observe daylight saving rules (in which clocks go backward and forward by a fixed amount at two points in the year), a clock change means a timezone change, not a change of the offset associated with the timezone.

The rules by which DST changes happen and the record of their history is standardized by convention in the tz database (see link above). The tz database is maintained by UCLA professor Paul Eggert and hosted by IANA. The particular DST scheme that a country observes is given a name that combines a continent with a place name (usually a city but not always!) such as Americas/New_York, Asia/Sakhalin , Antarctica/DumontDUrville.

Here in London, we’re used to the Europe/London timezone that alternates between GMT and BST (UTC and UTC+1 respectively), changing twice a year when we transition to and from our daylight savings rule.

(Side note: For anyone working in aerospace, the moon and other celestial bodies will have their own timezone by the end of 2026.)

Timestamps/Instants

Called various different things in different contexts, a timestamp (e.g. the core component of time.Time in Go, Temporal.Instant in the proposed JS Temporal package or an Instant in the Java STL) represents a specific point in time, often represented as the number of nano/micro/milliseconds since a well-known point on the UTC timeline (for the above languages, the start of the UNIX epoch at 00:00:00UTC on January 1 1970). Setting aside relativistic concerns, they unambiguously represent a moment in time (generally ignoring leap seconds, of which 27 have been added since the start of the epoch) and can be manipulated and compared without needing to worry about timezones and daylight savings rules.

Our system is designed to deal with instants where possible, converting to local or offset times only at the edges of the system (i.e. when we’re displaying data to users) or when we’re doing calculations that require local time—but more on that later.

Durations

The aforementioned concepts refer to points in time, but we frequently need to specify the passing of time or measure the distance between points in time. We call these measures durations. Some durations are absolute, like 3 hours, 3600 seconds, or 120 hours, for example. These represent an unambiguous distance between two instants and can be added or subtracted from each other to yield new durations, or combined with an instant to yield another instant.

Others have a little more ambiguity, and similarly to the local times above, encode some simplifying assumptions about the usual passing of time. For instance, “From now until the same time tomorrow” generally means “in 24 hours”, but that can vary if there’s an overnight DST change or if we’re crossing borders. It might seem trivial, but when you’re building a scheduler that serves customers around the world in local time, these things matter.

The build

On the server

We use Postgres for application data storage and, as a result, we can make use of the somewhat confusingly named timestamptz column type. This stands for Timestamp with timezone, although you’ll notice that Postgres actually stores these columns as UTC instants, with no timezone (or offset) information. The idea is that the Postgres devs have done a lot of the thinking and heavy lifting around consistent timezone conversion behavior and using timestamptz delegates responsibility for that behavior to Postgres itself which:

  • Should eliminate the possibility of timezone conversion errors in the application layer
  • Standardizes timezone conversion behavior across the product, particularly relevant when dealing with ambiguous DST-crossing local time issues.

With Postgres handling our timezone conversion, we designed our application layer to work with instants as much as possible. Any conversion to and from local times (e.g. when dealing with schedule working hours that are expressed in wall-clock time) happens in clearly defined and tested places and we don’t leak them across the rest of the scheduling logic.

Golang has a solid datetime library (setting aside the controversial decision to stray from strftime formatting strings when parsing and interpolating time information) that handles almost everything we need, but we did find the Monday library useful for date/time localization.

The UI

Storing and working with UTC is one thing, but our customers live and work across the world. They (usually) want to see times relevant to their local time zone, and it’s been interesting getting feedback from customers about how we should handle this. Unsurprisingly, what works for one user doesn’t always work for another, so we’ve been on a bit of a journey when it comes to displaying times in the app.

Our schedules are associated with one timezone but frequently involve users spread across more than one (e.g. follow-the-sun rotas) that provide 24/7 coverage by scheduling responders across the world during their respective daytimes. Initially, we decided that we would display all of the times associated with a given schedule in the scheduled timezone. We believed that this consistency would be useful for on-call managers.

As it turns out, lots of users were quick to tell us that they’d actually rather be shown times in their local timezone, even when dealing with schedules that are built around a different timezone. For one customer, being able to view schedules in multiple time zones at the same time was a hard requirement and something we hadn’t anticipated.

Changing this behavior isn’t a huge lift, but it’s something worth considering early on when you’re building your UI. If your customer base is used to working across the globe, be ready to build some flexibility into how times are represented (we’ve had to add timezone selectors, copy, and tooltips in places that we hadn’t planned for) and take some, well, time to really think about how working across timezones affects your product flows.

One thing that turned out to be a hit was using the Sherlock library to allow users to enter natural language phrases to pre-fill our override creation form. It started as a quick experiment, but we’re seeing that people are using it regularly to save time filling out the temporal fields on the form.

Testing and reliability

Reliability is a critical property of a successful on-call product. Downtime has a direct impact on our customers’ revenue and we have to make sure that the right people get paged at the right time, every time. Our scheduler is at the center of ensuring who the right people are, and it became clear that we needed a solid testing strategy.

Testing time-based systems adds an extra layer of difficulty; it’s not enough for your code to be correct, it has to run correctly at the right time. Realistically modeling the passing of time in automated tests needs care as:

  • The input space is large. It’s all of the usual inputs to your code plus the current time, multiplied by all of the edge cases that can happen when we advance through time (especially when we’re dealing with time zones). Sometimes the same input arguments are evaluated twice on the same day, sometimes either side of midnight, sometimes over a DST rule change, sometimes a week apart over the new year, etc.
  • It’s easy to build a test rig that doesn’t reflect the details of how the system experiences time. How many of us have written tests that pass in our local timezone and then fail on a CI/CD instance running in UTC?

Architectural considerations

We invested a lot in building a performant way to evaluate schedules. In essence, this logic takes a schedule and time period as arguments and returns a time series that represents who’s going to be on call at what times during the input period. We invested a lot of time into making it flexible, fast, and well-tested to give us a rock-solid bit of code that we could use across the product, whenever we need to power on-call features.

Test strategy

We started by unit testing the components of our scheduler and indeed designed the scheduler to make it easy to do so. By investing in ensuring our time-based building blocks are sound, we can compose them without fear whenever we need to build new time-based features. This gives us a good foundation on which to build the rest of our testing strategy.

Snapshot testing

Our lead engineer wrote a while back about testing *weird stuff,* using snapshot tests to verify systems that can’t be satisfactorily verified with just unit and integration tests.

Our scheduler is a good candidate for this kind of testing, especially in its current state, before we’ve had a chance to add too many features and too much new complexity. It’s easy for us to use the scheduler to generate some known-good snapshots (in this case they’re JSON representations of schedule evaluations at different points in time) and then spend some time manually verifying them. When we’re happy, we can regenerate them on every build and diff the snapshots with the newly generated state to ensure we’ve not broken anything.

If we’re making changes to the scheduler that should alter the generated schedules, we can flag this with danger.js and make sure that the PR receives the extra scrutiny that it deserves.

Runtime auditing

As well as build-time testing, we wanted some extra confidence that the system is performing as we expect at run-time. To do this, we introduced a regularly scheduled audit job that runs independently of the scheduler and verifies that we’re making the right determination of who’s currently on call. The job takes all our active schedules, evaluates them, and compares the result to our cached currently-on-call values.

Given that it uses the same evaluation logic as the scheduler itself, it won’t catch bugs in the evaluation logic. However, correctly evaluating a schedule at the wrong time can still lead to us paging the wrong people, and our audit job protects against this.

Wrap up

This has been a quick look at some of the high-level concepts to consider when working with time. Hopefully, you’ve learned at least something new about time, timezones or the tz database that you didn’t know before. We covered a few specific things we did to build and test our scheduler, and although it’s nothing groundbreaking, hopefully, there’s something useful in here that you can take back to your team.

If you’ve got any thoughts on the above (even if it’s to tell me how badly wrong I’ve got it) we’d love to hear them, so drop us an email or get in touch on our socials.

Share on
Picture of Henry Course
Henry Course
Product Engineer

More from Behind-the-scenes building On-call

Launching incident.io On-call product
Engineering

Behind the scenes: Launching On-call

We like to ship it then shout about it, all the time. Building On-call was different.

Henry CoursePicture of Henry Course

Henry Course

8 min read
Engineering

Building On-call: Our observability strategy

Our customers count on us to sound the alarm when their systems go sideways—so keeping our on-call service up and running isn’t just important; it’s non-negotiable. To nail the reliability our customers need, we lean on some serious observability (or as the cool kids say, o11y) to keep things running smoothly.

Martha LambertPicture of Martha Lambert

Martha Lambert

21 min read
How on-call delivers to your phone
Engineering

Building On-call: The complexity of phone networks

Making a phone call is easy...right? It's time to re-examine the things you thought were true about phone calls and SMS.

Leo SjöbergPicture of Leo Sjöberg

Leo Sjöberg

7 min read
Engineering

Building On-call: Building a multi-platform on-call mobile app

What does it take to build a greenfield mobile app in 2024? When we launched On-call earlier this year, we had to find out.

Rory BainPicture of Rory Bain

Rory Bain

17 min read
Engineering

Building On-call: Continually testing with smoke tests

Launching On-call meant we had to make our system rock-solid from the get-go. Our solution? Smoke tests to let us continually test product health and make sure we're comfortable making changes at pace.

Rory MalcolmPicture of Rory Malcolm

Rory Malcolm

11 min read
Engineering

How we page ourselves if incident.io goes down

Learn how we tackle the ultimate paradox: ensuring our alerting system pages us, even when it’s the one failing. It's a common question - let's dive into detail on our "dead man's switch", how we stress-test our systems, and why we care so much about our setup allowing us to dogfood our own product.

Lawrence JonesPicture of Lawrence Jones

Lawrence Jones

8 min read

Move fast when you break things