Studio · Est. 2009 · Web + Software + Frameworks RSS · Start a Project →
Notes
Article · Notes

What a Web Crawler Is in 2026 (and Why AI Crawlers Changed Everything)

Date · 8 November, 2023
Cat · Notes
Read · 2 min
Rewritten · May 2026 — original URL preserved; body fully rewritten for 2026

For most of the web's history, "crawler" meant Googlebot. Maybe Bingbot. A handful of well-behaved bots that read your site, indexed it, and sent you human visitors in return. The trade was understood. Both sides benefited.

That trade broke in 2024. By 2026, "web crawler" is a much wider category, and the implicit social contract that used to govern it has fragmented. Here's what changed and what to do about it.

How crawlers work, technically

The mechanics haven't changed. A crawler:

  1. Maintains a queue of URLs.
  2. Fetches one of them via HTTP, respecting (or not) robots.txt.
  3. Parses the response, extracts links, and adds them to the queue.
  4. Stores the content for whatever purpose justifies the bandwidth.

What changed is who's running them, and what they want.

The three kinds of crawler today

Search engine crawlers

Googlebot, Bingbot. Still operate roughly under the original contract: read the site, index it, send users back. Honour robots.txt and crawl-delay directives. We don't block these.

AI training crawlers

GPTBot, ClaudeBot, PerplexityBot, Anthropic's, OpenAI's, Google's training crawlers, and dozens of smaller ones. They read your site to train or augment language models. Whether they send users back is, depending on the operator, anything from "occasionally with attribution" to "no, the model just absorbed your content."

Most respect robots.txt when explicitly named. Some don't.

Scraper bots

Everything else. Price scrapers, content thieves, SEO research tools, security scanners. Often spoof their user agent. Often ignore robots.txt. Often run from rotating residential IPs to evade detection.

What to put in robots.txt in 2026

A defensible default for a content site:

# Allow established search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Decide consciously about AI training
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Block the rest by default
User-agent: *
Disallow: /

Sitemap: https://example.com/sitemap.xml

Whether you Allow or Disallow each AI crawler is a judgment about your business model. If you sell content, you probably want to Disallow. If you sell credibility and want to be cited by AI assistants, you probably want to Allow.

What robots.txt cannot do

Stop a determined scraper. It's a voluntary protocol. The bots that ignore it are the ones you most want to block.

Effective defences in 2026 layer multiple controls:

  • Cloudflare's bot management or equivalent — fingerprints traffic at the edge.
  • Rate limits per IP and per ASN — slows aggressive crawlers without affecting humans.
  • Honeypot links — links no human follows, with rules to block IPs that hit them.
  • Tar-pitting — serving a slow trickle of bytes to known-bad bots, costing them more than it costs you.

What we tell clients

You probably don't need to fight every crawler. You need to know which ones are valuable, which ones are extracting value without giving any back, and which ones are actively harmful. Make conscious decisions in your robots.txt, then let the edge handle the bad actors. The energy you spend running cat-and-mouse with every new crawler is energy you're not spending on the work that pays.