AI crawler access is no longer a minor robots.txt setting. It is a product decision about where a site's content can appear, which systems can retrieve it, whether it can be used for model improvement, and how much control the publisher wants over AI surfaces.

TL;DR

OpenAI, Anthropic, Perplexity, and Google no longer fit into one "AI bot" bucket. OpenAI separates OAI-SearchBot, GPTBot, and ChatGPT-User. Anthropic separates ClaudeBot, Claude-User, and Claude-SearchBot. Perplexity documents PerplexityBot as search indexing rather than model pretraining. Google treats Googlebot as the access control for AI features in Search, while Google-Extended is a separate control for other Google AI use.

That means the serious question is not "should we block AI?" The serious question is "which AI use cases do we want to allow, and which ones do we want to restrict?"

What changed in crawler policy?

Crawler policy became more granular because answer engines now use the web for different jobs. A crawler can support search inclusion, user-triggered browsing, product discovery, model training, search-quality improvement, or ordinary Google Search.

OpenAI's crawler documentation is the cleanest example of the split. OAI-SearchBot is for search visibility in ChatGPT search features. GPTBot is for crawling content that may be used to improve foundation models. ChatGPT-User is for user-initiated actions and page visits, not automatic web crawling.

Anthropic uses a similar split. ClaudeBot is tied to model development. Claude-User supports user requests. Claude-SearchBot is for search-result quality and relevance.

This split matters because a publisher can want one use and reject another. A site might allow search inclusion while blocking model-training use. A documentation site might allow user-triggered retrieval. A paid publisher might choose tighter controls.

Why is one blanket robots.txt rule weak?

One blanket rule is weak because it collapses different business decisions into one technical switch. Blocking every AI-related user agent might reduce training exposure, but it can also remove the site from search-like answer surfaces or user-directed retrieval.

The opposite mistake is allowing every bot because the site wants citations. That may unintentionally permit use cases the publisher did not evaluate.

The practical matrix is:

Use case Example agents Publisher question
Ordinary Google Search and AI features in Search Googlebot Do we want the page eligible for Google Search and supporting links?
ChatGPT search inclusion OAI-SearchBot Do we want the page to appear in ChatGPT search answers?
Model improvement GPTBot, ClaudeBot Do we permit this content to be used for future model training?
User-triggered retrieval ChatGPT-User, Claude-User Do we allow a user to ask an assistant to fetch this page?
Perplexity search indexing PerplexityBot Do we want the page in Perplexity's source index?

That table should be discussed by product, legal, SEO, editorial, and engineering. It is not just an SEO checkbox.

How should AEO teams read OpenAI's crawler split?

AEO teams should treat OAI-SearchBot as the OpenAI crawler most directly tied to ChatGPT search visibility. OpenAI says sites opted out of OAI-SearchBot will not be shown in ChatGPT search answers, though they can still appear as navigational links.

That is the practical AEO point. Blocking GPTBot and allowing OAI-SearchBot are not contradictory if the publisher wants search visibility without model-training use. The settings are independent.

For a public source site like OptimizeAEO, the default policy should be explicit, not accidental:

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /private/

That example is not a universal recommendation. It is a reminder that policy should map to purpose.

How should AEO teams read Anthropic's crawler split?

Anthropic's documentation separates ClaudeBot, Claude-User, and Claude-SearchBot. The important AEO distinction is that disabling Claude-SearchBot can reduce visibility or accuracy in Claude search responses, while disabling ClaudeBot is a model-training preference.

That mirrors the OpenAI split. Training use and search/retrieval use should not be treated as the same thing.

Anthropic also says its bots respect robots.txt and support Crawl-delay. That gives site owners a rate-control lever, but it also creates a maintenance obligation. Robots.txt should be documented, versioned, and tested.

How should AEO teams read Perplexity's robots.txt position?

Perplexity says PerplexityBot will not index full or partial text content from a site that disallows it via robots.txt. It also says Perplexity may still index the domain, headline, and a brief factual summary if a page is blocked.

That creates a useful distinction for publishers. Blocking can reduce full-text source use, but it may not remove every reference to the domain or title. If a publisher wants complete removal from discovery surfaces, robots.txt may not be enough; noindex and other controls may be needed depending on the system.

Perplexity also says it does not build foundation models, so allowing PerplexityBot is not the same decision as allowing a model-training crawler.

What does Google make different?

Google makes the decision different because AI Overviews and AI Mode are Search features. Google's documentation says Googlebot is the robots.txt control for how pages are crawled for Search. It also says pages must be indexed and snippet-eligible to appear as supporting links in AI Overviews or AI Mode.

That means a site cannot block Googlebot and still expect Google Search AI features to use it as a supporting source. Google-Extended is a separate control for limiting use in some other Google AI systems, not the control for ordinary Search inclusion.

The practical rule: do not copy an "AI bot blocklist" into robots.txt without understanding whether it blocks ordinary search access.

What could go wrong?

The biggest failure mode is accidental invisibility. A CDN, firewall, security plugin, managed robots feature, or hosting rule can block crawlers even when robots.txt appears permissive.

Another failure mode is policy drift. A team sets a rule once, a vendor changes its user agents, a CDN adds bot-management defaults, and six months later the site is missing from surfaces it meant to allow.

A third failure mode is treating robots.txt as legal, technical, and product control all at once. Robots.txt is a signal to compliant crawlers. It is not a full access-control system.

What should the policy look like in practice?

The practical policy should be written as a table before it becomes robots.txt. The table should say what the business wants, then the technical file should implement it.

Example policy:

Crawler Allow? Reason Review trigger
Googlebot Yes Google Search and AI feature eligibility Any CDN/security change
OAI-SearchBot Yes ChatGPT search visibility Monthly log review
GPTBot Conditional Allow public guides, block private research Editorial policy change
ChatGPT-User Yes User-directed retrieval Abuse or rate issue
Claude-SearchBot Yes Claude search relevance Monthly log review
ClaudeBot Conditional Training preference Legal review
PerplexityBot Yes Perplexity source visibility Monthly log review

This table is more important than the first robots.txt draft. It forces the team to explain the business reason behind every allow and disallow rule. That prevents the two worst policies: blocking everything out of fear, or allowing everything because the team wants citations.

How should this change an AEO audit?

An AEO audit should now include crawler-purpose testing as a standard section. It is not enough to fetch the page once with a browser and call it accessible.

The audit should test:

  • browser user agent;
  • Googlebot;
  • OAI-SearchBot;
  • GPTBot;
  • ChatGPT-User;
  • ClaudeBot;
  • Claude-SearchBot;
  • PerplexityBot;
  • noindex and snippet controls;
  • CDN bot-management behavior;
  • server response variance by user agent.

The output should not be a vague recommendation like "review robots.txt." It should say, for example: "OAI-SearchBot receives 403 on guide pages because the CDN bot firewall classifies it as automated traffic. Googlebot receives 200. Fix CDN allowlist before evaluating ChatGPT search visibility."

That kind of finding changes the work. It tells engineering what to fix and tells editorial why rewriting the page is premature.

What to do Monday morning

1. Create a crawler policy table with one row per bot and one column per use case. 2. Decide separately on search inclusion, user-triggered retrieval, and model-training use. 3. Test robots.txt with OAI-SearchBot, GPTBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot, and Googlebot user agents. 4. Check CDN and firewall logs for blocked crawler requests. 5. Document every intentional block and the business reason for it. 6. Recheck the policy after major CMS, CDN, or security-plugin changes.

Crawler access is now part of the product surface. Treat it that way.

Sources