Simpsonian 🍁︎

Sysadmin Sunday: my vacuum cleaner killed my WiFi

True story: my vacuum cleaner killed my home WiFi network.

I mean… kinda. Sorta. Close enough.

If you want the full details, you'll need to read inordinate amounts of arcane Unix DNS configuration details. Or don't; I'm a blog post, not a cop.

In true Sysadmin Sunday fashion, I wanted to write down my debugging so that I can refer to it later—see that post to understand my weird notation below.

It's always DNS

Observation: when trying to surf the information superhighway, I am sporadically greeted by "Server not found" errors.

Knowledge: This is a DNS issue; it's always DNS. I've previously configured treebeard, my home server, to act as the DNS server for my local network, so we should look there.

Hypothesis: before checking the DNS server specifically, is something wrong with treebeard overall? Time to do my best Brendan Gregg impression…

josh@treebeard:~$ uptime
 16:37:16 up  7:18,  2 users,  load average: 4.12, 4.50, 4.60

Observation: Uhh, that seems way higher than I'd expect. (For the uninitiated: the "load average" in that output counts how many tasks have been waiting to run in the past little while1—given that this server is usually idle, I'd expect those numbers to all be well under one).

A quick glance at top reveals that tailscaled is using 400% CPU. Let's check the logs:

josh@treebeard:~$ journalctl -u tailscaled -f
Jun 17 21:13:22 treebeard tailscaled[71934]: trying bootstrapDNS("derp24c.tailscale.com", "208.83.233.233") for "log
.tailscale.com" ...
Jun 17 21:13:23 treebeard tailscaled[71934]: trying bootstrapDNS("derp7d.tailscale.com", "2403:2500:400:20::cfe") fo
r "log.tailscale.com" ...
Jun 17 21:13:23 treebeard tailscaled[71934]: bootstrapDNS("derp7d.tailscale.com", "2403:2500:400:20::cfe") for "log.
tailscale.com" error: Get "https://derp7d.tailscale.com/bootstrap-dns?q=log.tailscale.com": dial tcp [2403:2500:400:
20::cfe]:443: connect: network is unreachable
Jun 17 21:14:03 treebeard tailscaled[71934]: [RATELIMIT] format("dns: resolver: forward: no upstream resolvers set,
returning SERVFAIL") (19 dropped)
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: resolver: forward: no upstream resolvers set, returning SERVFAIL
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: resolution failed due to missing upstream nameservers.  Recompilin
g DNS configuration.
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: Set: {DefaultResolvers:[] Routes:{} SearchDomains:[] Hosts:9}
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: Resolvercfg: {Routes:{} Hosts:9 LocalDomains:[]}
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: OScfg: {}
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: resolver: forward: no upstream resolvers set, returning SERVFAIL
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: resolver: forward: no upstream resolvers set, returning SERVFAIL
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: resolver: forward: no upstream resolvers set, returning SERVFAIL
Jun 17 21:14:03 treebeard tailscaled[71934]: dns: resolver: forward: no upstream resolvers set, returning SERVFAIL
Jun 17 21:14:03 treebeard tailscaled[71934]: [RATELIMIT] format("dns: resolver: forward: no upstream resolvers set,
returning SERVFAIL")

Hmmmmmmmm. And how about dnsmasq itself:

journalctl -u dnsmasq -f
Jun 17 15:36:37 treebeard dnsmasq[12222]: failed to send packet: Operation not permitted
Jun 17 15:36:41 treebeard dnsmasq[12222]: Maximum number of concurrent DNS queries reached (max: 150)
# (After restarting dnsmasq)
Jun 17 15:39:36 treebeard dnsmasq[71406]: reading /run/dnsmasq/resolv.conf
Jun 17 15:39:36 treebeard dnsmasq[71406]: ignoring nameserver 192.168.8.250 - local interface
Jun 17 15:39:36 treebeard dnsmasq[71406]: using nameserver 100.100.100.100#53

Hypothesis: something is misconfigured somewhere in my DNS/Tailscale setup. Also, it seems like there's way too many DNS queries in flight at once—unclear if that's related or a separate issue.

Experiment: let's just shut off Tailscale entirely for now and see if we can get dnsmasq back to a healthy state. Tailscale's DNS server lives at 100.100.100.100, so let's remove any references to that we find. (From the tailscaled logs, it looks like it's missing an upstream location to which to forward DNS requests it can't answer by itself—we could probably fix that misconfiguration, but let's first start by simplifying as much as possible.) I saw /run/dnsmasq/resolv.conf in the dnsmasq logs, so let's start there.

josh@treebeard:~$ cat /run/dnsmasq/resolv.conf
nameserver 192.168.8.250
nameserver 100.100.100.100
# Then, after editing /run/dnsmasq/resolv.conf:
josh@treebeard:~$ cat /run/dnsmasq/resolv.conf
nameserver 9.9.9.9

Observation: changes to that file seem to be picked up immediately by dnsmasq:

Jun 17 21:21:39 treebeard dnsmasq[72630]: reading /run/dnsmasq/resolv.conf
Jun 17 21:21:39 treebeard dnsmasq[72630]: using nameserver 9.9.9.9#53

Observation: browsing the web seems to work as expected again! Let's validate that with an explicit DNS query from my desktop:

[josh@galadriel ~]$ dig +short +identify cbc.ca
96.7.25.105 from server 192.168.8.250 in 63 ms.

Knowledge: 192.168.8.250 is treebeard's private IP address, so everything looks good here.

Experiment: if I bring tailscaled back up, but with MagicDNS disabled, does everything still work?

josh@treebeard:~$ sudo systemctl start tailscaled
josh@treebeard:~$ sudo tailscale set --accept-dns=false
josh@treebeard:~$ uptime
 21:30:54 up 12:12,  2 users,  load average: 0.25, 0.19, 0.33

Eh, I still see the same weird DNS-related logs from tailscaled, but dnsmasq isn't on fire anymore and CPU usage seems normal. Plus my usual "does my tailnet work outside my home?" test is passing: on my phone, toggle WiFi off, data on, Tailscale on, then try to connect to one of my self-hosted services. It works!

Whose DNS config is it anyway?

Whew, this is a good place for a breather: my home WiFi is functioning as expected, and treebeard is looking healthy now. But this whole ordeal has exposed that my DNS setup on treebeard has been working more by chance than anything else—it'd be nice to build a deeper understanding of how all this is configured.

How does one gain deeper insight into Unix networking configuration? In 2025, the answer is obvious: ask ChatGPT, then corroborate with Arch Wiki, StackOverflow, and related Wikipedia pages. This section summarizes my learnings therefrom.

The big picture

Say some program on your (Unix) computer wants to do a DNS lookup. What happens next? The process will likely look something like this:

  1. The program calls a glibc function like getaddrinfo.
  2. getaddrinfo will call into glibc's internal Name Service Switch (NSS) machinery, which reads /etc/nsswitch.conf to figure out how it should resolve the host name to an IP address.
  3. If dns is listed as an option on the relevant /etc/nsswitch.conf line, the NSS resolver will read /etc/resolve.conf to find which DNS servers to query.

Of course, a program doesn't have to follow those steps—you could write a program that accepts a domain name from a user then explicitly sends a DNS query to 9.9.9.9 to resolve that name. But doing so is (usually) a Bad Idea™ that will anger your local sysadmin: if a user has some custom DNS configuration (presumably for a good reason!), your program will completely ignore it.

Let's walk through those steps on treebeard to ensure we can follow exactly what's going on.

/etc/nsswitch.conf

As we just learned, /etc/nsswitch.conf is the main entry point, so let's make sure it looks okay first:

josh@treebeard:~$ cat /etc/nsswitch.conf
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd:         files systemd
group:          files systemd
shadow:         files
gshadow:        files

hosts:          files mdns4_minimal [NOTFOUND=return] dns
networks:       files

protocols:      db files
services:       db files
ethers:         db files
rpc:            db files

netgroup:       nis

Yep, /etc/nsswitch.conf looks fine to me. The hosts line is the one we care about:

hosts:          files mdns4_minimal [NOTFOUND=return] dns

That config line specifies that there are three possible "sources" for the hosts database; i.e., when trying to resolve a domain name, use these sources in order. In more detail those are:

  1. files: check for a match in /etc/hosts. (No network requests required!)
  2. mdns4_minimal: attempt to resolve the host via multicast DNS (herein mDNS).
    • [NOTFOUND=return]: since the .local domain is only intended for use over mDNS, if mDNS runs successfully but responds "that host doesn't exist," don't continue. Again, .local is only intended for use in personal networks; hitting an external DNS server to resolve such a host basically never makes sense. This directive ensures we abort before attempting to do so.
    • Somewhat related: mDNS was the subject of my first Sysadmin Sunday.
  3. dns: attempt "traditional" DNS resolution (i.e. actually send good ol' DNS requests on port 53), reading /etc/resolv.conf to discover the DNS servers to query.

Or if you prefer that in pictures:

Diagram of how DNS resolution (via NSS) works on treebeard

A visual representation of how DNS resolution (via NSS) works on treebeard

To give an example of when each source might be used, let's consider looking up the following hosts on treebeard:

The mystery of /etc/resolv.conf

So far, so good: with our newfound understanding of /etc/nsswitch.conf, we know we need to update /etc/resolv.conf to point right back to localhost, so that dnsmasq answers any of treebeard's own DNS queries. Let's see what we're dealing with there:

josh@treebeard:~$ cat /etc/resolv.conf
# resolv.conf(5) file generated by tailscale
# For more info, see https://tailscale.com/s/resolvconf-overwrite
# DO NOT EDIT THIS FILE BY HAND -- CHANGES WILL BE OVERWRITTEN

nameserver 100.100.100.100
search tailaf9b3.ts.net tailaf9b3.ts.net

If you've ever cracked open /etc/resolv.conf, or other config files, you've likely seen a similar "DON'T EDIT THIS YOUR CHANGES WILL DISAPPEAR" admonishment. This awkwardness is inherent when configuration data lives in a single file: in a world where a single superuser is editing that file by hand, there's no issue, but when multiple services (include the user) want to manage it, how can they collaborate? There's no foolproof algorithm to apply certain changes to an arbitrary config and make everyone happy. So in practice, one service tends to commandeer the config (like we're seeing here), and adds a scary warning to give the user a heads-up.

So who exactly is responsible for /etc/resolv.conf right now? Seems like it's Tailscale, given the comment, but running some other ChatGPT-supplied diagnostics shows a couple other possibilities:

josh@treebeard:~$ sudo which resolvconf
/usr/sbin/resolvconf
# So maybe resolvconf is in charge?
josh@treebeard:~$ systemctl is-active systemd-resolved
active
# ...but systemd also claims to be doing things?
josh@treebeard:~$ file /etc/resolv.conf
/etc/resolv.conf: ASCII text
josh@treebeard:~$ readlink -f /etc/resolv.conf
# ...but also the config isn't a symlink??

To be honest, I'm still not fully sure what happened here; I can only imagine it was a gruesome and gorey battle between daemons, the detritus of which is all that remains for us unlucky viewers.

…but given that /etc/resolv.conf is not currently a symlink, I don't think anything else is going to try to monkey with it? So let's take full control ourselves and see if we get clobbered:

# (After editing in vim)
josh@treebeard:~$ cat /etc/resolv.conf
# Note to self: I want full manual control over this config (as opposed to having it managed by systemd/etc.).
# We're running dnsmasq on this server, so all DNS queries should go there (i.e. localhost).
# We'll configure dnsmasq separately to specify its upstream DNS servers.
nameserver 127.0.0.1

Remember, right now we're telling treebeard which IP address to use for DNS resolution—since dnsmasq is running on treebeard itself, we want localhost to be the nameserver.

But of course, dnsmasq on treebeard won't be able to answer most DNS queries by itself (i.e. it won't know the IP address for cbc.ca without consulting an external source); it will need to forward those to an external DNS server. We specify those nameservers with the server directive in /etc/dnsmasq.conf. Like before, let's use Quad9:

# (After editing in vim)
josh@treebeard:~$ grep ^server /etc/dnsmasq.conf
server=9.9.9.9

A /runner in the night

Earlier, we saw that treebeard was getting its upstream DNS servers from /run/dnsmasq/resolv.conf. By hand-editing that file to contain only the single nameserver I want, we were able to get things working. But that's not a long-term solution, because something keeps recreating /run/dnsmasq/resolv.conf every time I restart dnsmasq, which interferes with our lovingly crafted /etc/dnsmasq.conf

josh@treebeard:~$ sudo systemctl restart dnsmasq
josh@treebeard:~$ ls -l /var/run/dnsmasq/resolv.conf
-rw-r--r-- 1 root root 52 Jun 19 12:42 /var/run/dnsmasq/resolv.conf
josh@treebeard:~$ sudo rm /var/run/dnsmasq/resolv.conf
josh@treebeard:~$ ls -l /var/run/dnsmasq/resolv.conf
ls: cannot access '/var/run/dnsmasq/resolv.conf': No such file or directory
josh@treebeard:~$ sudo systemctl restart dnsmasq
josh@treebeard:~$ ls -l /var/run/dnsmasq/resolv.conf
-rw-r--r-- 1 root root 52 Jun 19 12:42 /var/run/dnsmasq/resolv.conf

This isn't a practical problem, because adding no-resolv to /etc/dnsmasq.conf prevents dnsmasq from reading /var/run/dnsmasq/resolv.conf, but I want to get to the bottom of this—what's creating that file?

I know I can use lsof to show files being held open by processes, but my guess is that whatever creates this exits immediately, so lsof won't spot it. (And if the main dnsmasq process is holding it open after reading it, that's not much help either.) How can we set up some kind of "monitor" to catch the file creation? ChatGPT suggests either auditd or inotifyctl, but I don't have either of them installed. Before trying those, let's look at how dnsmasq is specifically being invoked on treebeard by peeking at the systemd unit file:

josh@treebeard:~$ systemctl cat dnsmasq
# /lib/systemd/system/dnsmasq.service
[Unit]
Description=dnsmasq - A lightweight DHCP and caching DNS server
Requires=network.target
Wants=nss-lookup.target
Before=nss-lookup.target
After=network.target

[Service]
Type=forking
PIDFile=/run/dnsmasq/dnsmasq.pid

# Test the config file and refuse starting if it is not valid.
ExecStartPre=/etc/init.d/dnsmasq checkconfig

# We run dnsmasq via the /etc/init.d/dnsmasq script which acts as a
# wrapper picking up extra configuration files and then execs dnsmasq
# itself, when called with the "systemd-exec" function.
ExecStart=/etc/init.d/dnsmasq systemd-exec

# The systemd-*-resolvconf functions configure (and deconfigure)
# resolvconf to work with the dnsmasq DNS server. They're called like
# this to get correct error handling (ie don't start-resolvconf if the
# dnsmasq daemon fails to start).
ExecStartPost=/etc/init.d/dnsmasq systemd-start-resolvconf
ExecStop=/etc/init.d/dnsmasq systemd-stop-resolvconf


ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Hmm, these lines seem pretty suspicious:

# The systemd-*-resolvconf functions configure (and deconfigure)
# resolvconf to work with the dnsmasq DNS server. They're called like
# this to get correct error handling (ie don't start-resolvconf if the
# dnsmasq daemon fails to start).
ExecStartPost=/etc/init.d/dnsmasq systemd-start-resolvconf
ExecStop=/etc/init.d/dnsmasq systemd-stop-resolvconf

Snooping around in /etc/init.d/dnsmasq gives us all the details:

josh@treebeard:~$ grep -m 1 -A 18 "RESOLV_CONF" /etc/init.d/dnsmasq
# RESOLV_CONF:
# If the resolvconf package is installed then use the resolv conf file
# that it provides as the default.  Otherwise use /etc/resolv.conf as
# the default.
#
# If IGNORE_RESOLVCONF is set in /etc/default/dnsmasq or an explicit
# filename is set there then this inhibits the use of the resolvconf-provided
# information.
#
# Note that if the resolvconf package is installed it is not possible to
# override it just by configuration in /etc/dnsmasq.conf, it is necessary
# to set IGNORE_RESOLVCONF=yes in /etc/default/dnsmasq.

if [ ! "${RESOLV_CONF}" ] &&
   [ "${IGNORE_RESOLVCONF}" != "yes" ] &&
   [ -x /sbin/resolvconf ]
then
    RESOLV_CONF=/run/dnsmasq/resolv.conf
fi
josh@treebeard:~$ grep -A 18 "start_resolvconf()" /etc/init.d/dnsmasq
start_resolvconf()
{
# If interface "lo" is explicitly disabled in /etc/default/dnsmasq
# Then dnsmasq won't be providing local DNS, so don't add it to
# the resolvconf server set.
    for interface in ${DNSMASQ_EXCEPT}; do
        [ ${interface} = lo ] && return
    done

    # Also skip this if DNS functionality is disabled in /etc/dnsmasq.conf
    if grep -qs '^port=0' /etc/dnsmasq.conf; then
        return
    fi

    if [ -x /sbin/resolvconf ] ; then
        echo "nameserver 127.0.0.1" | /sbin/resolvconf -a lo.${NAME}${INSTANCE:+.${INSTANCE}}
    fi
    return 0
}

So it looks like dnsmasq sets RESOLV_CONF=/run/dnsmasq/resolv.conf, then calls resolvconf (if available) to fill out that config file. That makes some sense to me: dnsmasq recognizes that resolvconf should be the "canonical" source for this information, and so dnsmasq intentionally defers to resolvconf.

But where does resolvconf get its configs?? ChatGPT points me to /run/resolvconf/interface, which indeed seems to be the place:

josh@treebeard:~$ for FILE in /run/resolvconf/interface/*; do echo "$FILE"; cat "$FILE"; done
/run/resolvconf/interface/eth0.dhclient
domain lan
nameserver 192.168.8.250
/run/resolvconf/interface/lo.dnsmasq
nameserver 127.0.0.1
/run/resolvconf/interface/systemd-resolved
nameserver 100.100.100.100
search tailaf9b3.ts.net

We could go even deeper here (how were those configs created?), but my curiosity is satiated for today. Let's disable those ExecStartPost and ExecStop directives we saw in the unit file earlier and lay /run/dnsmasq/resolv.conf to rest once and for all. Manually editing /lib/systemd/system/dnsmasq.service seems unhygienic to me; ChatGPT redirects me to systemctl edit dnsmasq, which instead creates an override file—neat!

josh@treebeard:~$ systemctl cat dnsmasq | grep -E 'Exec(StartPost|Stop)'
ExecStartPost=/etc/init.d/dnsmasq systemd-start-resolvconf
ExecStop=/etc/init.d/dnsmasq systemd-stop-resolvconf
josh@treebeard:~$ sudo systemctl edit dnsmasq.service
# Add the following override in the editor:
# [Service]
# ExecStartPost=
# ExecStop=
josh@treebeard:~$ sudo systemctl daemon-reexec
josh@treebeard:~$ sudo systemctl daemon-reload
josh@treebeard:~$ sudo systemctl restart dnsmasq
# Unsure if I needed _all_ of those; copy-pasting from the AI...
josh@treebeard:~$ systemctl cat dnsmasq | grep -E 'Exec(StartPost|Stop)'
ExecStartPost=/etc/init.d/dnsmasq systemd-start-resolvconf
ExecStop=/etc/init.d/dnsmasq systemd-stop-resolvconf
ExecStartPost=
ExecStop=

And now for the moment of truth:

josh@treebeard:~$ ls -l /run/dnsmasq/resolv.conf
-rw-r--r-- 1 root root 52 Jun 19 12:42 /run/dnsmasq/resolv.conf
josh@treebeard:~$ sudo rm /run/dnsmasq/resolv.conf
josh@treebeard:~$ ls -l /run/dnsmasq/resolv.conf
ls: cannot access '/run/dnsmasq/resolv.conf': No such file or directory
josh@treebeard:~$ sudo systemctl restart dnsmasq
josh@treebeard:~$ ls -l /run/dnsmasq/resolv.conf
ls: cannot access '/run/dnsmasq/resolv.conf': No such file or directory
# Success! File was not created after restarting the service.

🥳

Summary

Wow, that took longer than expected. But the end result is exactly what we wanted: everything on my local network uses treebeard to resolve DNS queries (including treebeard itself). dnsmasq on treebeard is configured to resolve most *.simpsonian.ca queries by itself, and anything it can't resolve gets routed upstream to 9.9.9.9. Tailscale's MagicDNS is disabled on treebeard—with everything we've learned, I'm sure I could get it to play nicely with the rest of our configurations, but I wasn't using MagicDNS anyways, so let's leave it off.

Root causes

So, what was the actual inciting incident that led to all this gnashing of the teeth? Well, I live in an fairly old apartment building (without central AC), and I was hosting some visitors from out of town. Accordingly, I had a portable AC unit running and started to take care of the vacuuming—but as it turns out, a 15-amp circuit can't support a refrigerator, portable AC, vacuum, and all my home electronics. Not for long, anyways.

Specifically, here's my best guess at the causal pathway:

  1. I update Tailscale over the course of months without actually ever restarting it.
  2. One day, I run the vacuum cleaner on a busy circuit, causing a fuse to blow.
  3. All my home electronics violently lose power.
  4. As a side effect, treebeard is restarted, and there's something wrong with the resulting mess of DNS-related config files post-reboot.
  5. Oh also, I think I might've partially fried my old router; that started showing double-digit packet loss on the internal network too. But that's a story for another day…

So like I said: my vacuum cleaner killed my WiFi. But at least now I know how to fix it.

Until next time, may all your DNS queries resolve successfully.

Addendum

For the brave souls that made it through that great slog, allow me to reward you with a story this whole saga reminded me of. Many moons ago, when I was still fresh-faced and full of zeal, I did an internship at Facebook. When something went bigly wrong at Facebook, the policy was to write up a postmortem detailing exactly what 'sploded, and how it came to pass. Well, one day some trees fell and severed some cables at a data centre, leading to a partial outage and this all-time great example of dry wit:

SEV-551 Postmortem

Description: a number of trees fell, severing all links to a data centre, causing 24% of all traffic to be dropped for 46 minutes.

Root cause: yes.


1

Technically it's the exponential weighted moving average of that count, evaluated for the past 1, 5, and 15 minutes.