Reference

Robots.txt is the AEO control plane. llms.txt is only a map.

Most AEO teams have these two files backwards. Robots.txt is the file that actually changes whether AI engines can reach you. Llms.txt is wayfinding. Here's how to configure both without overclaiming.

By: CM Charles Morris
Published: May 7, 2026
Reading time: ~9 min

TL;DR

Robots.txt controls whether AI engines can crawl, search, or retrieve your content. Llms.txt is a useful map for some workflows but not a proven citation lever. Most teams have these reversed. Order of operations: fix access first, content structure second, llms.txt only if it serves a real workflow.

Most AEO teams I see have these two files backwards. They’re meticulous about llms.txt because it sounds like the AEO file. Their robots.txt is a copy-paste from a Stack Overflow answer in 2024.

That’s the wrong way around. Robots.txt is the file that actually changes whether AI engines can reach your content. llms.txt is a Markdown map that some engines read and most don’t. Both are useful, but they do different jobs, and treating them as interchangeable produces bad outcomes — usually pages that are perfectly optimized for citation but blocked from being fetched.

This is a reference piece on what each file actually does, what the public evidence supports, and how to configure both without overclaiming.

tl;dr

robots.txt controls crawler access for systems that honor it. OpenAI and Anthropic both document separate bots for training, search, and user-triggered retrieval — so blocking “AI bots” as one group can accidentally block answer visibility you wanted.
llms.txt is a proposed Markdown file that points LLMs at useful pages. It can help package documentation. It is not, on current evidence, a ranking or citation requirement.
Google has explicitly stated llms.txt will not be used for AI Overviews. Other engines vary.
The order of operations matters: fix access first, fix content structure second, add llms.txt only if it serves a real workflow.

What each file actually does

The two files answer different questions. Conflating them is where almost every mistake starts.

robots.txt answers “can the engine fetch this page?” It’s a plaintext file at the root of your site (example.com/robots.txt) that tells compliant crawlers which URLs they can request. Compliance is voluntary — there’s no enforcement mechanism — but the major AI companies say they honor it.

llms.txt answers “if the engine is here, what should it read first?” It’s a Markdown file at the root (example.com/llms.txt) that gives an LLM-friendly table of contents: a title, summary, sections of important links, and optional metadata. The official spec describes it as an aid for inference-time context, especially when context windows can’t hold an entire site.

These are different jobs. robots.txt is access control. llms.txt is wayfinding. You can have wayfinding without access (useless — nothing fetches it) or access without wayfinding (fine — engines fetch what they fetch). You cannot substitute one for the other.

How robots.txt works for AI crawlers (and why most configs are wrong)

Most “block AI bots” robots.txt files I see were written when the AI bot landscape was simpler. There was GPTBot, you blocked it, you were done. That model is now wrong.

OpenAI and Anthropic have each split their crawlers into three jobs:

Training crawlers — collect data to train future models
Search/index crawlers — build an index that powers in-app search results
User-triggered retrieval — fetch a specific page when a user (or a custom GPT, or a Claude tool) asks the model to read it

Each runs under a different user-agent string. Each can be allowed or denied independently. And the key consequence: if you block all three thinking you’re blocking “AI training,” you also block your visibility in ChatGPT search and your retrievability when a Claude user asks the model to read your page.

Here’s the current bot inventory for the two engines that document this clearly:

Platform	Training	Search/index	User-triggered
OpenAI	`GPTBot`	`OAI-SearchBot`	`ChatGPT-User`
Anthropic	`ClaudeBot`	`Claude-SearchBot`	`Claude-User`

OpenAI’s docs say the settings are independent. Allowing OAI-SearchBot while disallowing GPTBot is a documented, supported configuration — appearing in ChatGPT search without contributing training data.

Anthropic now matches that pattern. ClaudeBot for model development, Claude-SearchBot for search quality, Claude-User for user-directed fetches. Anthropic’s documentation specifically warns that disabling Claude-SearchBot may reduce visibility and accuracy in user search results, and disabling Claude-User can prevent retrieval when users ask Claude to read a specific page.

A working starter configuration that allows search visibility while blocking training:

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

This isn’t a universal recommendation. Some publishers want training crawls explicitly. Some want neither. The point is the structure: decide separately for training, search, and user-triggered access. Treating them as one decision is leaving signal on the table.

One caveat that’s emerged recently: OpenAI’s documentation now notes that OAI-SearchBot and GPTBot may share information internally. If you allow both, OpenAI may use a single crawl for both purposes. So allowing OAI-SearchBot while disallowing GPTBot works, but expect some operational noise.

What the public evidence on llms.txt actually says

This is the section where I want to be careful, because it’s the part most AEO content gets wrong.

Here’s what the public evidence supports, in order of confidence:

High confidence: Google won’t use it. At Google’s Search Central Deep Dive in Bangkok, Gary Illyes stated that Google does not support llms.txt and is not planning to. John Mueller has echoed the same. Google’s own AI features documentation says normal SEO best practices apply to AI Overviews and there are no extra technical requirements. If you’re optimizing for Google AI Overviews, llms.txt is not a lever.

Medium confidence: Some AI engines fetch it. There are reports of OpenAI’s crawlers requesting llms.txt files, including periodic re-fetches that suggest active monitoring. That’s a fetch, not a citation. A bot requesting your file proves the file is being read; it does not prove the contents influence what the model says.

Low confidence: It improves citation rates. I haven’t found a reproducible public experiment showing that adding an llms.txt to a site lifts citations in any major answer engine. Anecdotes exist. Vendor case studies exist. A controlled study with proper methodology does not. Treat the citation-impact claim as unproven until someone runs the experiment cleanly.

What we know it’s good for: Documentation packaging. OpenAI publishes one for their developer docs. Stripe publishes one. Anthropic redirects to one. These are real, useful artifacts for agents and developers deliberately loading docs into an LLM context. That’s a different value proposition from “boost AI search citations” — and it’s the value proposition the spec actually claims.

So the honest position: llms.txt is a developer-experience feature with possible AEO upside. Treating it as required AEO infrastructure is overclaiming. Treating it as worthless is underclaiming. It’s somewhere in between, and where exactly depends on your audience.

The four misconceptions worth naming

These come up enough that they’re worth flagging directly.

1. “llms.txt is robots.txt for AI.” It isn’t. The official spec says so explicitly. Robots.txt is access control. llms.txt is content guidance. They share a filename pattern and nothing else.

2. “Blocking AI training also blocks AI visibility.” Not necessarily. OpenAI and Anthropic both document separate bots. You can block training while keeping search visibility — that’s a documented, supported configuration.

3. “Allowing every AI bot is good AEO.” No. Bot policy is also a content licensing, server load, and brand governance question. Some training crawls produce real cost (high request volume) and zero attribution. AEO is a reason to make bot policy specific, not a reason to abandon governance.

4. “An llms.txt can compensate for weak content.” It can’t. If a human reader can’t tell what your page says in the first 30 seconds, an LLM-specific table of contents won’t fix it. Headings, claim density, citation density, schema, internal linking — all of those matter more than the wayfinding file.

A practical implementation order

Here’s the order I’d recommend, from highest leverage to lowest:

1. Build a crawler policy matrix. For each AI platform you care about, decide separately whether you want training access, search/index access, and user-triggered access. Write the decisions down with a one-line reason for each. This is the document you’ll want when someone asks “why do we let GPTBot crawl us?” six months from now.

2. Audit your live robots.txt against the matrix. Don’t trust the file in your repository. Many sites have managed robots.txt features, CDN rules, security middleware, and bot-fight settings that change what crawlers actually see. Fetch the file as a crawler would (curl https://yoursite.com/robots.txt) and compare to your policy matrix. Mismatches between intent and reality are common.

3. Check every subdomain. Docs subdomains, help centers, blogs, and marketing sites often have different robots files. A perfectly configured www.example.com/robots.txt doesn’t help if docs.example.com/robots.txt is wide open or fully blocked.

4. Make sure important content is fetchable without aggressive JavaScript. AI crawlers vary in JS execution. Google can render JS. Several others either don’t or do it poorly. If your main answer content only appears after a complex client-side render, some engines miss it entirely. Server-side rendering or static HTML for citable content is a real AEO win.

5. Then, optionally, add llms.txt. If your audience includes developers loading your docs into LLM contexts, an llms.txt is genuinely useful. Keep it short. Link to canonical pages. Don’t stuff it with claims that aren’t visible elsewhere on the site. Treat it as a navigation aid, not a marketing surface.

The order matters. I see a lot of teams add llms.txt first because it sounds like the AEO move, then never get around to fixing the access layer. They’ve done the cosmetic work and skipped the structural work. The Monday morning list at the bottom of this piece reverses that order.

What I’d test next

The honest test for whether your robots.txt is the bottleneck: instrument your access layer and see what’s actually being blocked.

Server log review. Pull access logs for the relevant user agents. OAI-SearchBot, ChatGPT-User, GPTBot, Claude-SearchBot, Claude-User, ClaudeBot. For each: how many requests? What status codes? Which pages? Which getting 4xx/5xx? Which getting through to the application? This tells you the empirical truth of what crawlers are doing, which is often quite different from what you think your robots.txt is doing.

Pre/post access change citation panel. Before changing access, run a stable prompt panel across the engines that matter. After changing access, wait at least 48 hours (OpenAI’s docs note ~24 hours for search systems to adjust to robots.txt changes), then re-run the same panel. Log differences. This won’t prove causation from one experiment, but if citation behavior shifts after access changes, that’s directional evidence worth following.

llms.txt fetch tracking. If you decide to add an llms.txt, log requests to it. Are AI user agents fetching it? How often? Are linked pages getting fetched afterward? Most teams add the file, never check whether it’s being read, and assume it’s working because adding it felt productive.

I haven’t run all of these myself for this piece. The robots.txt configuration patterns are well-documented enough that I’m confident in them. The “does llms.txt actually move citations” experiment is one I’d genuinely like to see someone run cleanly. If you do, send me the results.

What to do Monday morning

Build a crawler policy matrix that separates training, search/index, and user-triggered retrieval for each AI platform you care about. Write down your decision and one-line reason for each cell.
Fetch the live robots.txt for every subdomain that has citable content. Compare against the matrix. Fix mismatches.
Check CDN rules, WAF settings, and bot-fight features. Hidden blocks live there, not in your repo.
Audit important pages for JavaScript dependence. If the citable content needs a client-side render to appear, fix that before optimizing anything else.
Add llms.txt only if it serves a workflow you can name (developer docs packaging, agent navigation, internal tool integration). Skip it if you can’t articulate the use case.
After any access change, log the date, run a stable AI citation prompt panel after 48 hours, and record what shifted.

The boring takeaway: AEO infrastructure is mostly access control plus content structure. The exotic-sounding files are downstream of getting those right.

Sources

AI features and your website (Google, accessed 2026-05-06)
Overview of OpenAI Crawlers (OpenAI)
Does Anthropic crawl data from the web, and how can site owners block the crawler? (Anthropic)
The /llms.txt file (llmstxt.org)
Google says normal SEO works for ranking in AI Overviews and llms.txt won’t be used (Search Engine Land)
OpenAI Platform llms.txt (example file)
Stripe Docs llms.txt (example file)

tl;dr

What each file actually does

How robots.txt works for AI crawlers (and why most configs are wrong)

What the public evidence on llms.txt actually says

The four misconceptions worth naming

A practical implementation order

What I’d test next

What to do Monday morning

Sources

One careful email, every other week.

Related reading

Chunking is why clear sections get retrieved more often

HubSpot’s AEO launch: from content tactic to revenue system

AEO is not (SEO for AI)