It's always DNS (LLM edition)

2025-02-01

Our Kubernetes cluster's DNS recently experienced partial degradation. We noticed it via DNS resolution errors in Sentry mostly when calling internal services.

This incident took a bit to root cause - not that it was particular thorny in retrospect, per se, but it involved diving into guts of things one doesn't normally care much about. While this isn't the only interesting incident that happened this year ¹, I rather enjoyed the investigation; and since it's not every day I get to drive an incident (thankfully!), I thought I'd log the experience, company postmortem aside.

Preamble

The DNS Haiku meme. Claude says: 'A meme styled like a traditional East Asian scroll or painting, featuring black text reading 'It's not DNS / There's no way it's DNS / It was DNS' alongside minimal ink brush illustrations of a flying bird and flowers. The image includes what appears to be Japanese or Chinese characters and a red seal stamp in the corner, mimicking traditional art styles.'

It was apparent the issue wasn't happening on 100% of DNS requests, so impact was thankfully rather minimal. One critical internal service we call is pgbouncer, but thankfully resolving those hosts rarely failed since that basically only happens at the app's startup. When they did fail, the pod would just restart; and since the problem wasn't on 100% of queries, the app was pretty much up for the whole duration of the incident. Other workloads that call internal services, such as ML services, run on Sidekiq and so will retry on failure - they were also mostly unaffected. Most of the user-observable impact was some failed exports, since those jobs have a low retry count.

The incident technically started on a Friday at 3pm UTC, but since its impact was so minimal, we chalked it up as a temporary failure at the time and didn't pay much attention to it. It wasn't until Monday that I was tasked with getting to the bottom of it, as sentry errors were very much piling up ².

Let the investigation begin

Our infrastructure engineer was out Friday and Monday, which complicates things a bit, but he actually replied to me during his PTO (thanks Ricardo!): our initial theory was that DNS was failing due to excessive load, as we had seen that in the past ³. Our kube-dns deployment was already scaled beyond a reasonable scale, using kube-dns-autoscaler ⁴ (we use kube-dns since we're on GKE).

My first thought was to restart kube-dns. Spoiler alert, this didn't solve the problem - I had actually tried it on friday. Although, as we'll see later, it actually had a chance to, for an unexpected reason!

To reproduce the problem, and actually see it with my own eyes, I simply exec'd into a pod and resolved a host on a loop - this gives a very immediate sense of the scale of the problem:

while :; do time getent hosts pgbouncer-web-ampledash.core; done

Occasionally, maybe once every 15s, I'd notice a slow request that would eventually time out. So I asked: what DNS server are we using, exactly? Good ol' /etc/resolv.conf has the answer, naturally:

root@ampledash-console-75467d5c8c-n5j9b:/app# cat /etc/resolv.conf
search core.svc.cluster.local svc.cluster.local cluster.local c.ampledash-prod.internal google.internal
nameserver 172.18.130.10
options ndots:1

Great! But who is 172.18.130.10?... Well, it's a Service, a ClusterIP. There's a bunch of kube-dns pods, so this is probably some sort of entrypoint load balancer thingy? Provided by GKE?

At around this time, my colleague was experimenting with turning off our Kafka consumers to see if the issue really was load related, since they were pretty busy. I kept running the DNS requests loops in multiple pods.

At a certain point, the loops stop breaking on a slow request! Rejoice! Was our problem solved?

For a while it really looked like we had been spared a much deeper investigation into the guts of kube-dns. But alas, the problem came back some minutes later.

False start

I then had the idea of talking to the kube-dns pods directly, instead of going through 172.18.130.10. Is it possible the problem was this supposed load balancer? Listing the internal ips of the kube-dns pods is easy enough with kubectl -n kube-system get endpoints kube-dns -o jsonpath='{range .subsets[*].addresses[*]}{.ip}{"\n"}{end}. So then I ran some loops on those IPs. Surprisingly, they work correctly!

This is where we got a bit lost. Why would the main ip be flaky? It's more likely to be a problem in an individual kube-dns pod, rather than the main ip. I skipped lunch and spent a bunch of time trying to figure out how exactly kube-dns works, and why it could be flaky. Looked into the pods definitions; exec'd into them; found out it runs dnsmasq; talked to it directly; checked the args, environment, config; ...

In retrospect it was probably a bit too much, but we had no idea where the problem was! Not long after, though, I had a better idea: talk to all the kube-dns pods, and see which ones are failing! This is the cool part. I had been brainstorming with Claude pretty much during the whole investigation - we even only half-joked we should probably try sharing the screen and talking to Gemini 2, which had just come out.

But here comes the cool part: talking to all the pods, while seemingly easy when you're not in incident-mode, seemed a bit too cumbersome after multiple hours of incident investigation. But Claude offered up a script that worked with just a minor follow-up. Here it is:

servers=(
<kube-dns ips here>
)

monitor_server() {
    local server=$1
    while true; do
        dig @"$server" pgbouncer-web-ampledash.core.svc.cluster.local | grep "timed out" | xargs -I {} echo "$(date '+%Y-%m-%d %H:%M:%S') - Server $server - {}"
    done
}

# Start monitoring each server in the background
for server in "${servers[@]}"; do
    monitor_server "$server" &
done

With this script in hand, I quickly noticed that there were only two pods that were consistently timing out! This prompted me to check which node they were running on, just for sanity, which is a simple kubectl command away. And bingo! They're running on the same node, and they're the only two pods running on that node!

This is why I said earlier that restarting all pods could have solved the issue - node placement could have changed such that no pods were placed in the offending node. I believe that pod node placement, and nodes coming up and down, was also the reason that turning off our Kafka consumers briefly "solved" the issue!

From here, we could have investigated why the kube-dns pods on that node in particular were timing out regularly. But we honestly didn't bother. We didn't assign a high probability to the issue happening again, and we had already looked into this for quite a bit! So we decided to just remove the node from the cluster and move on. This was just a simple kubectl cordon and kubectl drain. Once that was done, we stopped observing DNS timeouts! 🥳

So, for me the main learning is: lean on LLMs heavily to generate quick 'n' dirty medium-complexity scripts that allow you to make questions about the system! Normally you'd have to spend a little bit of time getting them right, but LLMs drastically lower the cost of making such questions. One could imagine generating strace commands, eBPF or even dtrace. I admit the final script where we talk to each pod on a separate thread seems pretty trivial in hindsight, but it felt more magical during the incident!

Soon, autovacuum killer, soon I'll write something about you...↩︎
Our volume of sentry errors also turned out to be a whole saga, which hopefully is solved now. We had to upgrade to the business plan to get access to quota-protecting features such as rate limiting and Delete and discard events. But even then, rate limiting is not perfect since it's project-wide! To really protect our quota over, say, a weekend, we'd have to set it so low as to make sentry basically useless for normal usage! As I understand it, Sentry doesn't support rate limiting per-error for only slightly understandable reasons. I ended up having to setup custom per-error rate limiting (where our "unique key" of an error is slightly different from Sentry's) using the SDK's before_send callback and a sidekiq limiter -- but I digress.↩︎
Our last DNS outage was a whole thing. Apart from the immediate impact, we rely heavily on DNS in general -- every day, I'm still astonished how much we use DNS at amplemarket! -- we make a bunch of decisions based on DNS records, and even persist them in our database. We crawl a lot of the interwebs - frankly, not enough! So a bunch of things had to be recalculated. Since DNS is still flaky even when it's working correctly - it's the wild west out there - I had centralized our explicit DNS queries (which took some effort!) to a singular class some months ago. This DNS class does automatic retries with different servers. Essentially, we want to make sure that if a record is empty, it's really empty, from multiple points of view. Thanks to this change, though, all our external DNS queries were completely unaffected! I intentionally made it so that the DNS class doesn't use our cluster's internal DNS resolver - both to reduce its load, but also since it's slower. Maybe this deserves its own post, but for now it'll live in this footnote!↩︎
I ended up asking Claude to write me a small script to simulate the final number of kube-dns replicas based on the configurations of the autoscaler, which are a bit weird, and so I couldn't help but mention them.
```
nodes = 20.0
cores = 152.0
cores_per_replica = 6.0
nodes_per_replica = 1.0
replicas = [
  (cores.to_f / cores_per_replica).ceil,
  (nodes.to_f / nodes_per_replica).ceil
].max
puts "Number of replicas: #{replicas}"
```
↩︎