Engineering

Building On-call: The complexity of phone networks

How on-call delivers to your phone

Arguably the most important part of an on-call product is knowing that you will be notified when things break, wherever you are. When it comes to SMS and phone call notifications, we have to leave the familiar realm of the internet and JSON responses, and deal with systems that provide limited observability and insight into what’s gone wrong.

How calls and SMS reach your phone

A lot of things can go wrong when you try to send an SMS or make a phone call. To understand why, it’s useful to understand how an escalation like "Production is down" goes from our systems to your phone.

When we make a call or send an SMS, we send an API request to a telecom API provider, who will place that call on our behalf.

Our provider then forwards that request to one of their telecom network partners. Which partner it’s sent to will depend on which number we’re sending from: for example, it might be AT&T when we call from a US number, and Three when we call from a UK number.

Once it reaches the network partner, it mostly works like a regular phone call or SMS: the carrier will connect to whichever network your phone is on, and establish the call or deliver the SMS, and the rest is in the hands of your carrier.

But with this many steps, things are bound to go wrong. Especially when, through all this, our telecom API provider tries to give us status updates on the call or SMS. We should be getting information about the call being established or why it failed, and when an SMS enters their systems, is sent from their partner network, and when it’s received by your carrier.

Normally, this all works as it should, but telecom is an old business, and sometimes, getting an error code of “the user didn’t pick up”, doesn’t actually mean that the user didn’t pick up…

So here are some things you might’ve thought were true about phone calls and SMS.

Things you thought were true about phone calls and SMS

Busy means the user can’t pick up

When a user can’t pick up the phone, such as when they’re already on another call, you get the response code 486 Busy Here. It should mean the person you’re trying to reach is busy. But some carriers will reject calls from international numbers with a 486 code, instead of a 400 Bad Request or 403 Forbidden. This isn’t something that’s even tied to a specific country. Instead, it’s all decided carrier-by-carrier, and there’s no registry of how different carriers respond. It’s all trial and error.

High-risk numbers are high-risk

Our primary telecom API partner, Twilio, has a concept of “high-risk numbers”. A high-risk number should be one that’s identified as having a high risk for toll fraud. So for our trusted users who want to get a phone call when production goes down, we shouldn’t need to worry about high-risk numbers, right? Wrong. High-risk numbers are number ranges, and while some numbers in that range might have a high risk of toll fraud, there will likely be plenty that are perfectly legitimate.

A Sent SMS means it’s been received by the target network

When we send SMS, they go through stages of QueuedSentDelivered. Sent indicates that the SMS was sent from our API partner’s telecom network partner (e.g. AT&T in the diagram above), and Delivered (or Undelivered if it’s rejected) indicates that the target carrier (your phone network) accepted the SMS.

At least, that’s how it’s supposed to work. Some carriers, like Ireland’s Eir network, don’t accept international SMS, but instead of responding with a rejection, we just get nothing! To work around this, we now mark outbound SMS as “stuck” if they don’t move to a terminal state quickly enough, and then retry sending from a different number.

Buying phone numbers is easy

In both the UK and the US, it’s pretty straightforward to buy an outbound number to send SMS or make calls from. You pick what you want to do with the number (make phone calls and send SMS), fill out a regulatory registration form to detail how you’re going to use the number, and pay some money. If you want a shortcode number for SMS, you fill out a shortcode application, wait a couple of weeks, and pay some money.

But try buying a phone number in Latvia - you can’t. Or a shortcode in Ireland? Tough luck, we were told the last time a shortcode became available was about a year ago! What about Germany, one of the biggest economies in the EU? You’ll need a German business address and business registration for that, yikes.

If an SMS delivers now, we should be able to send an SMS to the same person in an hour

Some countries, notably China, apply rules for sending SMS depending on the time of day and the urgency of your SMS 🙃

Anyone can reply to a shortcode number

We use shortcode numbers, like 65 956, to send on-call notifications in the US and UK, to improve our deliverability. However, some UK carriers prevent their customers from sending SMS to a shortcode number (which we rely on for people to acknowledge escalations), and it’s not even configurable!

If someone can send SMS to a UK number, it means they can send SMS to our UK number

When we don’t have a local number to use for SMS notifications, we usually fall back to one of our UK numbers. Now, if we’re able to send an SMS to a user in another country, and they receive that SMS, surely they can also reply, right? Nope. Even if someone is able to send international SMS, and can send SMS to a different number in the UK (or any other country we send from, we tried Sweden too), their carrier might still be unable to pass the SMS message to our telecom network 😭

Did we miss anything?

These are some of the things we’ve learned from our relatively short time working in the world of telecom. If you’ve spent some time banging your head against carrier regulations, SIP codes, and incomprehensible request diagrams (have you ever seen a SIP signaling diagram?!), and have a few more things you think we need to be aware of for building a world-class on-call product, we’d love to hear from you.

Picture of Leo Sjöberg
Leo Sjöberg
Product Engineer

Modern incident management, built for humans