We recently moved our infrastructure fully into Google Cloud. Most things went very smoothly, but there was one issue we came across last week that just wouldn’t stop cropping up.
What follows is a tale of rabbit holes, red herrings, table flips and (eventually) a very satisfying smoking gun. Grab a cuppa, and strap in.
Our journey starts, fittingly, with an incident getting declared... 💥🚨
The trigger for this incident was a larger-than-normal number of Postgres connection timeouts. This wasn’t something we’d seen with our previous infrastructure, and we’d done plenty of testing beforehand, as you'd expect when migrating millions of requests and huge amounts of data.
Even so, it’s always hard to pre-empt everything, so concerned we’d missed something, we got to work.
When we looked at the number of new connections being opened per second, it was in the hundreds, spiking up and down like crazy.
We weren’t quite sure what was happening, but we had a few initial theories. One of those was that there was potentially a race condition where connections were expiring. any goroutines were trying to re-use the same dead connection at the same time and subsequently opening up new connections en-masse.
We figured it was worth trying two things:
Shortly after making those changes, we immediately noticed an improvement. New connections created dropped from ~200/s to ~1/s. Phew, back to normal. Still plenty to investigate, but healthy graphs, at least.
Postgres wasn’t the only part of our infrastructure having a tough time at this point. In our product, we use a memcached instance to hold transient data and speed up some API responses. Usage of this cache was also experiencing sporadic connection timeout issues.
Our memcache client has a default network timeout of 100ms. That’s pretty generous, given that our memcache server is generally <1ms away, but does mean that any TCP packet that's been dropped will probably result in a timeout being hit.
From our perspective, that's fine and actively desirable—we don't want to delay responses on the basis of the cache being unavailable.
We were using a default configuration parameter of 2 idle connections, which is the default, and there was a suggestion here that increasing this value could help with timeouts. We didn’t buy that, especially as that person later follows up, saying that it should work fine with the default value.
Having read the library code, this was also our understanding.
We did notice that there's nothing that validates the health of a connection after checking it out from the pool, which means that if you last shoved a connection back into the pool a couple of days ago, it could still be picked up and we might try to send something down it just to receive a response of “there’s nobody here anymore!” However, we didn’t think this was actually what was happening—it definitely looked more like connection timeouts.
We opted to try increasing the number of MaxIdleConns from 2 to 20. The library says it should be set to “higher than your peak parallel requests,” and we certainly did hit that threshold, so it seemed reasonable to give it an order of magnitude bump and see what happened.
The difference was almost immediate:
However, a few days later, we started seeing many more “read i/o” timeouts. The plot thickens…!
In an effort to reduce unknown variables, we decided to temporarily move away from Google’s hosted memcached to our own, running inside our Kubernetes cluster. This would, in theory, mean even lower latency, fewer network hops, and fewer unknowns or deviations compared to the managed service.
The first blue line is when we made that switch, and you’ll notice it didn’t really seem to make much difference:
That second blue line though, that looks extremely promising. What happened there? Well…
While all this had been going on, another unrelated issue had cropped up. Thanks to an accidental database join, we’d been making thousands of duplicate network calls to an external third party. Not the end of the world, but not ideal, either.
Guess when the fix for that got merged? You got it—the second blue line in the image above marks when that fix landed in production, and all the database connection and cache issues pretty much disappeared.
This suggested that it was highly likely that there weren’t any fundamental issues with GKE or GCP networking. We therefore figured it was more likely that we'd got ourselves into a really bad place by trying to do so many network calls in parallel as a result of the bad query.
The frustrating question, though, was why? Why would doing lots of external network calls unrelated to our database or our cache lead to an increase in Postgres and Memcache connection and request timeouts!?
This could take a while…
So, although the database seemed OK now, the cache errors still hadn’t gone away entirely, so we dug deeper.
The error itself was pretty straightforward. The app was occasionally experiencing timeouts while setting a value in our cache.
saving value to cache: setting cache value: read tcp 10.16.4.33:35586->10.67.0.5:11211: i/o timeout
It was specifically hitting this most frequently when caching the list of timeline events for an incident (a record of all the things that happened, from messages to status page updates to code changes). This list includes data from a number of sources and is hit pretty frequently, so we cache it, to ensure users always get a speedy response.
This was only happening for a tiny handful of requests among hundreds of thousands, with seemingly no other consistent pattern when it came to which users, organizations, nodes, or requests it was affecting.
At this point, having made several improvements based on hunches, we still didn’t fully and confidently understand what was happening, let alone what we needed to do to eradicate the last few remaining errors.
We were left scratching our heads and decided to pause for the evening.
As Lawrence very astutely put it at the time:
The next day, we took a closer look at the logs for the Kubernetes node which had handled one of the affected requests. We could clearly see multiple cache failures happen during this request, all due to cache timeouts.
Taking a look at the trace for the request, we could see something pretty interesting. There were a few hundred cache set operations happening, seemingly simultaneously, for one request. There were also a ton of calls to generate URLs for objects in Google Cloud Storage at the same time.
These both make sense—when displaying a timeline, we pull in all the data we need, caching it as we do. For many items, there are also images or attachments we need to fetch. For longer timelines, each page of data is still substantial, so a few hundred items (and therefore a few hundred calls) isn’t all that surprising.
There are two things that are interesting here, though:
Looking at the node metrics, we also see a pretty big spike in the number of new connections being opened.
At the same time, we also see an increase in dropped packets on that node.
Looking at the logs for the node more broadly during this window, we can also see that there were even failed health checks because it couldn’t reach the cache (a totally separate request). Something’s definitely up!
This revelation suggests that this isn't a process-local issue. What we mean by this is the fact that we were having network issues between our Kubernetes node and the cache is not necessarily related to something like having loads of goroutines trying to write to the cache at once (at least not directly); it's something affecting the node (VM instance) as a whole.
This didn’t make a lot of sense at first. We’ve got a super simple GKE setup, and we haven’t seen this type of behavior in similar clusters that we’ve worked with in the past. The only relevant configurable behavior on a cluster like this was that we had Dataplane V2 turned on
Still, at this point, even that deserved scrutiny and so we set about scouring the docs. Shortly after, we spotted what looked very much like a smoking gun:
In certain cases, GKE Dataplane V2 agent Pods (anetd) can consume a significant amount of CPU resources, up to two or three vCPUs per instance. This occurs when there's a high volume of TCP connections being opened and closed rapidly on the node. To mitigate this problem, we recommend implementing keep-alives for HTTP calls and connection pooling for the relevant workloads.
Now we had a workload (anetd) to look at, we looked at the logs but sadly there was nothing obvious there. We did see a decent jump in CPU, though:
When you consider that this is not a consistent issue, and the fact that these metrics are counters, so will be smoothed out by the window function, it's very possible, even likely, that for a couple of seconds the anetd pod was indeed really busy.
If you think about it, it all starts to line up pretty much perfectly with what we’d seen and done up until this point:
We did consider turning off Dataplane V2. However, this wasn’t an appealing option for a few reasons:
So many things, but a few key takeaways for us included:
Hopefully you enjoyed following us down the rabbit hole on this one and although these bugs are always a little frustrating it’s always nice to get to the bottom of them and share what we learned.
Also, if you have expertise in GCP, Kubernetes and building strong infrastructure foundations for fast moving product teams, we’re actively looking to hire our first few Platform Engineers to focus on this kind of challenge full time. Check out the job posting here.
Moving fast does not happen by accident. Here is some of the intentional things our engineers do to move so quickly!
Cutting through the hype and dollar signs, why should you actually join incident.io? And also, why might this not work for you
In the past year, we've reimagined how we build AI products at incident.io, moving from simple prompt based features to now building full-blown AI-native systems end to end. Learn why we’re hiring AI Engineers, what that work looks like, and how it’s changing the future of incident response.
Ready for modern incident management? Book a call with one our of our experts today.