JonLuca's Blog

Snowmen, Recruiters, and Terry Pratchett: The Web's HTTP Header Junk Drawer

I crawled the top 1,000 sites and read their response headers. Browser fossils, CDN breadcrumbs, leaked codenames, a snowman, and a tribute to Terry Pratchett.

HTML is what a site shows you. JavaScript is what it does. Headers are what it can’t help telling you.

They leak the habits of the machinery underneath: CDNs, frameworks, caches, security controls, dead browser workarounds, migration scars, and bits of infrastructure that were meant to be temporary and never left. I thought it would be interesting to see what unique or non-standard headers the most popular sites on the internet were serving, so I crawled the top 1,000 domains by traffic and saved their response headers.

For each domain I hit the root page, then one same-origin internal page when I could find one safely. I used a browser-like HTTP client and retried likely challenge pages once with Playwright. If a response still looked like a WAF, a bot wall, or an access-denied page, I logged it and dropped it from the stats so they wouldn’t skew the header counts.

The dataset

Of the 1,000 domains I attempted, 417 returned clean pages I could analyze; the rest were WAFs, bot walls, or challenge pages I dropped. Those clean sites sent 677 distinct header names, and 619 of them aren’t in the IANA HTTP Field Name Registry.

All “not in IANA” means is that the name isn’t registered. The web runs on standards, but it also runs on convention, vendor prefixes, CDN metadata, and whatever someone shipped five years ago that still works. Learn only the standardized headers and you’ll have the formal grammar of the web while missing the dialect anyone actually speaks.

Browser fossils

The crawl turned up a whole fossil bed:

headersitesnote
x-xss-protection178Chrome’s old XSS auditor switch
pragma84deprecated, but still used for cache control
p3p24compact privacy policies for old Internet Explorer cookie behavior
x-ua-compatible13Internet Explorer document-mode hint
expect-ct6certificate transparency enforcement, now deprecated
feature-policy6predecessor to Permissions-Policy
content-md52obsoleted integrity header

P3P is my favorite. It’s a privacy-policy header from the early 2000s, remembered mostly because setting any plausible-looking value could talk old IE into accepting third-party cookies. One value in the crawl is exactly the kind of thing you hope to dig up:

CP="This is not a P3P policy! See g.co/p3phelp for more info."

The header is obsolete. The scar tissue stays.

Security headers are uneven

Among the 417 eligible sites, adoption of common browser security headers ranged widely:

headersitesshare
strict-transport-security27064.7%
x-frame-options21451.3%
content-security-policy19446.5%
referrer-policy10525.2%
permissions-policy6315.1%
cross-origin-opener-policy4510.8%
cross-origin-resource-policy286.7%
cross-origin-embedder-policy20.5%
clear-site-data00.0%

Plenty of sites have good reasons to skip some of these, so read the table as a map rather than a scorecard. COEP breaks the moment you embed a third-party resource. Clear-Site-Data is a sharp tool.

The spread still tells a story. HSTS is now normal. CSP is common but not yet universal. The newer cross-origin isolation headers remain rare. Browser security is a stack of migrations, and most migrations never finish.

Infrastructure leaks through

Some headers act as status lights on the machines behind the page.

headersiteswhat it reveals
server303server or gateway family
x-cache147cache state
via146proxy/CDN path
x-served-by53edge node or cache layer
x-powered-by47framework or runtime
x-request-id31request tracing
x-generator7CMS or static-site generator

Some Server values are dull: nginx, cloudflare, gws, AkamaiGHost. Others say more. The crawl found Express, Next.js, ASP.NET, and Drupal generator strings scattered around.

On its own, most of this is harmless operational metadata. But an outsider can use it to cluster sites by stack, host, CDN, framework, and sometimes deployment shape. The public web ships a lot of public implementation detail.

The largest headers were genuinely large

The biggest header block I saw belonged to state.gov, around 15.6 KB. The runners-up were big enough to notice:

domainpageapproximate header bytes
state.govroot15,622
state.govinternal15,620
eset.cominternal12,819
mixpanel.cominternal12,554
cursor.shroot11,702

These are approximate, since the crawler redacts sensitive-looking values before analysis. The point holds: headers can grow into a real chunk of the response. The bulk usually comes from reporting endpoints, CSP directives, cookies, or CDN metadata. The body gets blamed for web bloat, but the prelude packs on weight too.

Some headers are just for fun

A few headers in the crawl weren’t metadata at all. Someone wrote them by hand.

headervaluesite
x-clacks-overheadGNU Terry Pratchettmozilla.org, debian.org
x-hackera recruiting pitch (full text below)wordpress.com
x-recruitinga recruiting pitch (full text below)otto.de
x-launch-statusGo Flight!nasa.gov
x-olafwordpress.org
x-ballmerbff-sectionwelt.de
x-frankenstein-eligibletruebloomberg.com
x-minionVarnishwashington.edu
x-asdfl-70ivi.ru

The best one is x-clacks-overhead. Mozilla, Firefox, Ubuntu, and Debian all send GNU Terry Pratchett, a tribute to the author that started as a fan project and never stopped. It does nothing. That’s the point.

Automattic uses headers to recruit. WordPress.com sends x-hacker: Want root? Visit join.a8c.com/hacker and mention this header., and TechCrunch carries a VIP variant pointing at join.a8c.com/viphacker. The German retailer Otto runs the same play with x-recruiting: “Seems you like http headers. To write ours, apply at www.otto.de/jobs/ and mention this header.”

NASA sends x-launch-status: Go Flight!. WordPress.org sends x-olaf: ⛄, a snowman.

Others are internal names that escaped. Welt.de has an x-ballmer. Bloomberg has x-frankenstein-eligible. The University of Washington calls its Varnish cache x-minion. Fox News and WPS ship x-debug-* headers straight to production, and Xerox leaks its whole feature-flag list through a stack of rollout-* headers that quietly admit it runs Remix on Contentful.

Headers are where the web keeps its inside jokes.

A protocol, plus sediment

The clean mental model of HTTP is a request, a response, and a small set of standard fields. The real web is messier, and better for it.

It carries standardized headers and de facto CDN headers, knobs for browsers that barely exist, cache gossip, security policies stuck halfway through adoption, infrastructure quietly naming itself, and the occasional snowman.

That’s why headers are worth reading. They’re the receipt, and the receipt says the web is still alive, still migrating, and still hauling a lot of old furniture from apartment to apartment.