DNS Problems in IT Infrastructure: Why Alternatives Fail

Lou over at louwrentius.com recently published a thoughtful piece arguing that DNS, while useful for public-facing services, may be a liability inside an IT infrastructure, and that it might be worth exploring whether we can avoid it altogether. His proposed alternative, with help from Ansible, is to push direct IP configuration into application configs, with /etc/hosts available as a fallback for the cases where humans still need a name.

Oh my sweet summer child.

I want to be careful with the rest of this piece, because the underlying frustration is real. Anyone who has worked in operations long enough has internalized the haiku:

It's not DNS There's no way it's DNS It was DNS

DNS outages have, more than once, taken down companies the size of Meta. I have personally spent more hours than I care to count chasing stale cache entries, TTL misbehaviour, split-horizon weirdness, and resolver bugs across heterogeneous environments. The impulse to look at the blast radius of a borked DNS service and wonder if we could just not have it is not a crazy impulse. It is the same impulse that makes you eye your appendix on the way out of an emergency room and wonder if you really need it.

The rest of this post is about why, every time someone has tried to actually live without DNS in a non-trivial infrastructure, they have ended up reinventing DNS, badly. Usually with Ansible.

A few years ago I helped someone do exactly this

A few years ago I helped a customer move a large application portfolio from on-prem VMware to Azure. The migration itself was the easy part. Azure Migrate did a fine job of cloning the VMs into the new environment. The hard part was the same hard part it always is, which is that years of accumulated tribal knowledge had left the vast majority of these applications talking to each other by hard-coded IP address.

We could not avoid re-IPing the machines on the Azure side, because the new environment had to co-exist with the on-prem VMware deployment for the duration of the migration. So every cut-over became a multi-day game of whack-a-mole, tracking down every place in every configuration file, batch script, scheduled job, application property file, and stored procedure where someone, at some point in the last fifteen years, had hard-coded the IP of a host that no longer lived at that address.

You know what would have made that not a problem? The thing this article suggests we get rid of.

RFC 882 was written specifically about this

For those not familiar with the historical context, DNS exists because HOSTS.TXT, the centrally distributed file that mapped names to IP addresses on the ARPANET, stopped scaling. The first three paragraphs of RFC 882, the original DNS specification from 1983, explain this directly. The mechanism for updating that file and publishing it to all the machines that depended on it was not keeping up with the growth of the network. So Paul Mockapetris wrote DNS, and the world got an autonomous, distributed, cacheable, pull-based naming system that could scale.

Lou's post acknowledges the maintenance burden of pushing /etc/hosts everywhere but suggests Ansible makes it tractable. I have to be blunt. Suggesting that a push-based, agentless configuration management tool will scale to hundreds of thousands of targets, with pushes happening hundreds or thousands of times a day every time an IP changes anywhere in the fleet, is a junior-level idea at best. If I am being charitable, it is dark comedy. If I am not, it is professional malpractice. We have a name for the architecture being proposed. We invented DNS to replace it.

Or, to borrow a line from Rick and Morty: this is just DNS with extra steps.

Push versus pull is not a stylistic choice

The deeper architectural error in the proposal is the direction of data flow. DNS is pull-based. A client needs an IP, the client asks a resolver, the resolver answers. Caching, TTLs, and zone hierarchies make this work at internet scale with stunning efficiency.

Pushing the IP map out to every machine that might ever need it is the opposite of that. Every IP change now has to fan out to every potentially affected endpoint, every push has to succeed, and every failed push has to be detected and remediated before some unsuspecting service tries to connect to the stale address. You have not eliminated state synchronization, you have just moved it into a layer that does not have forty years of engineering behind handling it well. Ansible is a fantastic tool for what it is for. Real-time IP synchronization across a fleet is not what it is for.

And then there was the SQL macro

A while back I stood up a complete production NGINX deployment, configured as a TCP-mode reverse proxy, for a single Microsoft SQL Server instance. The reason was that the customer had moved the SQL Server from one cloud to another, the new instance had a new IP address, and somewhere inside the bowels of an Excel spreadsheet that the finance team relied on every month, a VBA macro had the old IP address hard-coded into it.

Nobody on the team was able or willing to update the macro. The original author was long gone. The file had a "there be dragons" reputation that nobody was eager to test, least of all on a workbook the finance team depended on for monthly close. So the only path forward was to keep the old IP alive in perpetuity, in the form of a small VM running NGINX that listened on the old address and forwarded the SQL traffic to the new one.

That is what hard-coded IPs cost you in practice. Not "ugh, we should refactor that someday." Permanent infrastructure that exists for the rest of time solely because nobody is allowed to change a string. Enterprises are weird like that, and DNS is one of the few tools that lets you route around the weirdness without standing up a permanent micro-VM as a memorial to a missing developer.

The cloud has already settled this

The cloud has, in the last decade, made the static-IP assumption that the proposal is built on functionally obsolete. Modern infrastructure is not just allowed to have ephemeral IPs, it requires them. Container instances get new addresses on restart. Kubernetes pods get new addresses on every reschedule. Serverless functions do not have an IP in any meaningful sense. Autoscaling groups churn instances all day. Load balancers move traffic between back-ends without telling anyone. Disaster recovery failovers bring whole workloads up in different subnets, sometimes in different regions.

The only thing that holds this together is a naming layer that can pull the current truth on demand. That naming layer is DNS. I once ran an Azure Container Apps workload where the underlying IP changed every time the container restarted. We had to lash it into Azure Private DNS specifically because the alternative was a small outage on every cold start. Pushing /etc/hosts updates at that pace, into a fleet of moving targets, with Ansible, while keeping the latency reasonable, is not a thing a sane person attempts.

What about the security points?

To Lou's credit, the security section makes real points. DNS traffic has historically been unencrypted. DNS has been used as a covert exfiltration channel. DNSSEC is operationally complex. None of these are invented concerns.

But the answer to "DNS can be a covert channel" is not "remove DNS." It is egress filtering, which the article itself recommends in the same breath. ICMP is a covert channel. NTP is a covert channel. HTTPS is a covert channel. Anything that can carry bytes across the perimeter is a covert channel, which is to say, everything. We do not solve the exfiltration problem by removing the protocols, we solve it by controlling the perimeter and the egress paths. The author arrives at exactly this conclusion themselves with the proxy-based egress allowlist suggestion. Good. That works for DNS too.

As for the encryption point, DoT and DoH have been broadly available for years. Most modern stacks support encrypted resolution. DNSSEC is indeed complex to run well, but the complexity of running it is dramatically smaller than the complexity of running a fleet-wide push-based naming system with sub-second consistency.

I get the temptation. The answer is still DNS.

I want to come back to where I started. The frustration that drives a post like Lou's is a real frustration. DNS, when it breaks, breaks loudly and breaks broadly. The blast radius is enormous. The dependencies are circular often enough that recovering from a serious DNS incident is its own special category of nightmare. The 2021 Meta outage was a textbook example of how a DNS problem can take an entire company offline, including the badge readers on the front doors. Everyone's instinct to look at that and say "we should depend on DNS less" is correct in spirit. It is the implementation that has to be argued with.

The answer to "DNS, when misconfigured, has huge blast radius" is better DNS practice. Redundant resolvers in independent failure domains. Sane TTLs. Out-of-band recovery paths that do not themselves depend on DNS to bring DNS back up. Monitoring. Postmortems that actually change behaviour. The discipline that any high-blast-radius dependency requires.

The answer is not to revert to HOSTS.TXT and rebrand the result with Ansible. The answer is not to assume that ephemeral cloud workloads can be wired together by pushing static maps around. This is what happens when you take the "it was DNS" meme a little too seriously. The meme is a joke about operational pain. It is not an architectural recommendation.

So yes. It was DNS. It is always DNS. It will always be DNS. And that is fine. That is what DNS is for.

It's Just DNS With Extra Steps