How to avoid bad assumptions during incidents

Trust, but verify!

When responding to an incident, your priority is to build an understanding of what happened and why, so you can understand how to fix it. This understanding is built on discoveries you make during your investigation, and the data you capture along the way.

One of the most difficult situations to recover from is when the data you collected is either flat-out wrong, or is an unverified assumption posing as a fact.

I once worked on an incident where we hit this problem several times over, and suffered because of it. It’s a good example of why you should trust-but-verify past conclusions, and use incident gear-changes as prompts to raise the burden of proof required for your key assumptions.

How the incident started

It’s Wednesday, and a product team reports that their integration is failing when making requests to the public API.

They say their code is failing because it can no longer find the HTTP headers they expected. The headers are there, but things like Content-Type are now downcased to content-type, and their codepath doesn’t handle it.

An incident is raised, and the first piece of data comes in:

1. My manager emails me saying “ahh, I just realised I forgot to forward you this email!“. It’s a Google Cloud Platform warning that says all Google Load Balancers will begin to downcase HTTP headers over the course of the next few months. We received this several months ago.

Perfect, we’ve found our issue!

We feel bad because we missed the notice, and there’s a certain karmic justice in us suffering this disruption as a result of our fumble. We open a ticket with Google asking them if they can opt us out, but assume they’ll say no and change our app to properly handle the header downcasing (as it should anyway, per the HTTP standard).

That’s not the end of it, though. The next day, we get urgent complaints from some important customers explaining they’re having issues too, and can we “please revert this change!”

This is a clear external impact, so the incident shifts gears. As the stakes increase, so too does the burden of proof required for any underlying assumption that goes into your incident response. Our original theory was that Google implemented this change (see [1]), which would mean we can’t revert it ourselves, so it’s important we confirm that this is the case before communicating to customers.

We attempt to refute (1) by confirming the downcasing is happening at the Google Load Balancer layer, and in doing so capture our next data-points:

We deploy a version of the public API from before we noticed this change and confirm that:

2(i). Request to the application receive case-preserved headers (ie. Content-Type) 2(ii). The same request made to Google Load Balancers in-front of the application yield downcased headers (ie. content-type)

From this, we conclude that:

3. Changes we deployed between the last known good time and the start of the incident were not what caused the HTTP headers to be downcased

This being the case, we continue the support chat with Google and respond to our customers explaining we can’t change this ourselves, and help them make the change to adapt to the new header values. It’s painful, but we manage to support people through the change and avoid any serious impact.

An awkward incident, but no harm done, right? Well, kinda.

Where the incident went wrong

As it turned out, most of the data we relied on during this incident was either incomplete, inaccurate, or downright misleading.

Several days after we’d closed the incident, Google confirmed that they’d never applied this HTTP header downcasing to our load balancers. While they had wanted to roll this out, as per their email, a number of their customers had non-compliant HTTP integrations that were case sensitive to headers and the change had been abandoned as not worthwhile.

Suitably shaken, we went back to the drawing board on (2) and tried to understand what had happened.

As we’d kept good logs during the incident, we were able to see the sample request we’d made to confirm 2(ii):

$ curl -v https://api.com/healthz
GET /healthz HTTP/2
&lt; content-type: application/json
...

2(i) had confirmed the application was sending Content-Type, and this request confirmed the Google Load Balancer responded with content-type.

But we’d missed a confounding variable, which changed the meaning of this evidence entirely. By making the request with curl, we’d been automatically upgraded to HTTP/2, which has changed a bit since HTTP/1.1…

Just as in HTTP/1.x, header field names are strings of ASCII characters that are compared in a case-insensitive fashion. However, header field names MUST be converted to lowercase prior to their encoding in HTTP/2

Sigh. This means the conclusion that our recent code changes hadn’t impacted headers (3) was founded on incorrect observations, so we looked closer.

After some digging, we realised an upgrade to HAProxy had been deployed around the time we’d started seeing issues internally. Looking at the changelog, we realised this HAProxy version bump included the following:

When HAProxy receives an HTTP/1 request, its header names are converted to lower case and manipulated and sent this way to the servers

Head meets desk, and cue some audible groans.

But why didn’t we catch this? HAProxy is an obvious place for a change like this to be sourced, so why didn’t we look closer?

I’d like to say the deployment time didn’t perfectly line up with us first noticing the issue, but that wasn’t the main reason we’d dismissed this. It wasn’t even the number of changes we’d deployed in this time (of which there were a lot!) or unfamiliarity with the tool.

The biggest reason it was ruled-out was because with (1), and the subsequent confirmation from (2), we had such a compelling story that we didn’t think we needed to consider alternatives.

3 lessons we learned during this incident investigation

Any good investigation builds on assumptions, and it’s good to remember this: even the assumptions you think you have proved might turn out to be false.

Catching these errors can be difficult, but there’s some strategies we have found useful to help reduce your chances of hitting them.

Show your working

Whenever you perform a test, run a command or verify a statement, share all the data you have into a log that everyone responding to an incident can see. Being able to go back into your incident Slack channel and see which commands you ran can be the difference between understanding an issue and an unsolvable problem.

In short, don’t let yourself end an incident with “what ifs?” about the actions you took.

Cross-check

Use your incident process to help prompt you to revalidate your assumptions. If you’ve changed the incident severity, the burden of proof you applied to old conclusions might no longer be appropriate for what you’re now dealing with.

Go back and recheck your work. Make sure you’re not hinging a company-extinction level incident on a gut feel decision someone had days ago when the incident looked more like a one-off bug.

Avoid the allure of a good story

The idea of us screwing up because we had screwed up our GCP support setup fed into my personal biases. I knew we weren’t great here, and I expect to eventually face consequences for this type of underinvestment.

Be aware that you may be swayed, and discard the information that isn’t relevant. Don’t let your biases sway others, and get other people to cross-check your data.

Sometimes you’ll do all of these things and still get it wrong, and that’s life for incident managers! But incorporating these tricks into your incident response playbook should reduce the chance you lean too hard on an assumption that has shaky foundations.