Robots.txt is still the file that can change whether several AI systems can crawl, search, or retrieve your content. Llms.txt is a useful map for some documentation and agent workflows, but it is not a proven way to earn AI citations.

TL;DR

  • robots.txt controls crawler access for systems that honor it. OpenAI and Anthropic both document separate bots for training, search, and user-triggered retrieval, so blocking "AI bots" as one group can accidentally block answer visibility.
  • llms.txt is a proposed Markdown file that points LLMs toward useful pages. It can help package documentation, but public documentation does not support treating it as a ranking or citation requirement.
  • For Google AI Overviews and AI Mode, Google says the same SEO fundamentals apply and there are no extra technical requirements. If a page is not indexed and eligible for snippets, an llms.txt file is not the fix.

What is the mechanic?

The mechanic is access control and machine-readable content discovery. AEO teams are trying to answer two separate questions: can an answer engine reach this content, and can it understand which parts matter?

robots.txt answers the first question for compliant crawlers. It is a file at the root of a site that tells crawlers which URLs they may request. Google describes its crawling and indexing documentation as covering how site owners can control Google's ability to find and parse content for Search and other Google properties.

llms.txt tries to answer the second question. The official proposal describes a Markdown file at /llms.txt that gives LLM-friendly background, guidance, and links. Its format is intentionally simple: an H1 title, optional summary, explanatory text, and H2 sections containing lists of important URLs.

Those are different jobs. robots.txt says "you may or may not fetch this." llms.txt says "if you are looking for the important parts, start here." Treating them as interchangeable is the most common technical mistake in AEO infrastructure.

Why does this matter for AEO?

This matters because answer visibility depends on access before content quality can matter. A perfectly structured page cannot be cited by a system that is blocked from fetching it or from using it in the relevant search surface.

OpenAI's crawler documentation is the cleanest example. It documents OAI-SearchBot for ChatGPT search results, GPTBot for model training, and user-triggered agents such as ChatGPT-User. OpenAI says the settings are independent, so a site can allow OAI-SearchBot while disallowing GPTBot.

Anthropic now documents a similar split. ClaudeBot relates to model development, Claude-User supports user-directed fetches, and Claude-SearchBot improves search result quality. Anthropic says disabling Claude-SearchBot may reduce visibility and accuracy in user search results, while disabling Claude-User can prevent retrieval in response to a user's query.

That turns robots.txt from a defensive legal file into an AEO routing file. The wrong rule can protect training data while also suppressing search visibility. The right rule can permit search and user retrieval while limiting training crawls where the platform supports that separation.

How does robots.txt work differently for AI crawlers?

Robots.txt works differently for AI crawlers because several AI companies now separate training, search indexing, and user-triggered retrieval into different user agents. That lets site owners make more precise decisions than "allow all" or "block all."

For OpenAI, the distinction is explicit. OAI-SearchBot is tied to appearing in ChatGPT search answers. GPTBot is tied to training use. OpenAI says allowing one and disallowing the other is a valid configuration, and that it may take about 24 hours after a robots update for search systems to adjust.

For Anthropic, the distinction is also explicit. ClaudeBot supports model development. Claude-SearchBot supports search optimization. Claude-User supports user-requested page access. Anthropic says its bots honor standard robots.txt directives and that IP blocking may be a weaker opt-out method because it can stop the crawler from reading the robots file.

That means AEO teams need a bot policy table, not a copied blocklist. A starter pattern might look like this:

# Allow ChatGPT search visibility, block model-training crawl.
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

# Allow Claude search and user retrieval, block model-training crawl.
User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: ClaudeBot
Disallow: /

That is not a universal recommendation. It is a working example of the decision structure: separate training, search indexing, and user-triggered retrieval where the platform gives you separate controls.

What does llms.txt actually do?

Llms.txt gives LLMs and agents a curated Markdown map of important content, but the proposal itself does not guarantee that any answer engine will fetch or trust it. The official spec says it is a proposal for helping LLMs use websites at inference time, especially when context windows are too small for an entire site.

The file is useful where users or tools deliberately load documentation into an assistant. Developer documentation is the obvious case. OpenAI and Stripe both expose llms.txt files for their documentation, and Anthropic's docs redirect to a Claude platform llms.txt. Those are real examples of a site making docs easier to package for LLM use.

The format is sensible. Markdown is easy for humans and language models to read. A short summary plus prioritized links can reduce the chance that an agent grabs the wrong page. A section called Optional can mark lower-priority URLs that can be skipped when context is tight.

The limit is adoption. The spec says the file can complement existing standards, but it does not create an access rule, a search index, or a citation contract. If an AI system does not fetch it, it has no direct effect. If it does fetch it, the system still has to decide whether the linked content is useful, current, trustworthy, and allowed.

Does llms.txt help rankings or citations?

We do not have enough public evidence to say llms.txt improves AI rankings or citations. The defensible claim is narrower: it may help agents and users find machine-friendly documentation when they intentionally use it.

Google is the clearest "do not overclaim" case. Google's own AI features documentation says normal SEO best practices remain relevant for AI Overviews and AI Mode, and that there are no additional requirements or special optimizations needed. Search Engine Land reported Gary Illyes saying Google does not support llms.txt and is not planning to use it for AI Overviews.

Search Engine Land's later critique of LLM-only pages makes the operational point well: build clean HTML, reduce JavaScript dependency for critical content, use structured data where official specs exist, and improve information architecture. Those are boring recommendations, which is usually a good sign in technical SEO.

For non-Google systems, the answer is less settled. Some sites report bot hits to llms.txt. That is not the same thing as proof of citation influence. A fetch can be monitoring, experimentation, indexing, validation, or a user-triggered tool. Until the platform documents usage or a reproducible experiment shows a lift, llms.txt should be treated as a helpful artifact, not a visibility lever.

How should a team implement this safely?

A team should implement crawler access first, then content structure, then optional llms.txt support. That order keeps you from spending time on a map while the road is closed.

Start with a crawler policy matrix:

Platform Search/index bot User retrieval bot Training bot Desired policy
OpenAI OAI-SearchBot ChatGPT-User GPTBot Decide separately
Anthropic Claude-SearchBot Claude-User ClaudeBot Decide separately
Google Googlebot for Search eligibility Varies by feature Google-Extended for Gemini/Vertex training controls Follow Google docs

Then inspect the live robots.txt, CDN rules, bot-management products, and edge workers. Many teams only check the file in the repository. That misses managed robots.txt features, security rules, and bot-fight settings that can change what crawlers actually see.

Next, confirm that important content is accessible without fragile client-side rendering. If the main answer, pricing, schema, or documentation appears only after complex JavaScript execution, some AI crawlers and retrieval tools may miss it. Google can render many JavaScript sites, but "Google can" is not the same as "every answer engine will."

Finally, add llms.txt if it helps users or agents navigate your content. Keep it short. Link to canonical pages. Do not stuff it with claims that are not visible on the human-facing site. The file should summarize the information architecture, not create a shadow version of the business.

What are the common misconceptions?

The first misconception is that llms.txt is "robots.txt for AI." It is not. The llms.txt proposal itself says robots.txt and llms.txt have different purposes: access control versus context and guidance.

The second misconception is that blocking training bots necessarily blocks all AI visibility. OpenAI and Anthropic both document separate roles. A site can choose to disallow some training crawlers while still allowing search or user-triggered retrieval where the platform supports that separation.

The third misconception is that allowing every AI bot is automatically good AEO. That can create legal, content-licensing, server-load, and policy issues. AEO is not a reason to abandon governance. It is a reason to make governance specific.

The fourth misconception is that an llms.txt file can compensate for weak pages. It cannot fix unclear headings, inaccessible content, stale claims, missing schema, poor internal linking, or weak off-site evidence. If a human cannot tell what your page says in the first 30 seconds, an LLM-specific table of contents is not the first problem.

What should you test next?

You should test whether crawler access, not prose style, is the bottleneck for your AI visibility. This is a small operational audit, not a ranking experiment.

First, fetch your live robots.txt as a crawler would see it:

https://example.com/robots.txt
https://www.example.com/robots.txt
https://docs.example.com/robots.txt

Check every subdomain that contains citable content. A docs subdomain, help center, blog, and marketing site can all have different rules.

Second, review server logs or CDN logs for the relevant user agents. Look for requests from OAI-SearchBot, ChatGPT-User, GPTBot, Claude-SearchBot, Claude-User, and ClaudeBot. Record whether they request important pages, whether they hit errors, and whether they are blocked by robots or security controls.

Third, run a prompt panel before and after access changes. Keep the prompts stable. Record the exact engine, model or mode where visible, prompt text, answer, citations, and date. If citations improve after access is fixed, you still cannot prove causation from one sample, but you can justify deeper testing.

Fourth, add llms.txt only after the access layer is clean. Track whether any AI user agents request it, whether linked pages are fetched afterward, and whether citation behavior changes. If nothing changes, the file may still be useful for developer workflows. Just do not call it an AEO win yet.

What to do Monday morning

1. Create a crawler policy matrix that separates training, search indexing, and user-triggered retrieval for each AI platform you care about. 2. Fetch the live robots.txt for every subdomain with citable content and compare it with the policy matrix. 3. Check CDN, firewall, and bot-management settings for hidden blocks that do not appear in the repository file. 4. Make important pages readable in clean HTML before adding LLM-specific files. 5. Add a short llms.txt only if it helps agents or users find canonical documentation, product pages, policies, or research. 6. Re-run a stable AI citation prompt panel after any access change and log the date, engine, user agent evidence, answer, and cited URLs.

Sources