Guide

The AEO Access Audit: How to Check Crawlers, CDNs, Robots.txt, and Source Eligibility

A step-by-step guide to auditing whether important pages can be crawled, indexed, retrieved, and used as answer-engine sources across Google, ChatGPT, Claude, Perplexity, and ordinary search.

By: CM Charles Morris
Published: May 14, 2026
Version: v1.0
Reading time: ~7 min

An AEO access audit checks whether the pages you want cited can actually be fetched, indexed, snippet-eligible, internally discovered, and retrieved by the systems you care about. It should happen before content rewrites, schema projects, or citation tracking.

TL;DR

Start with access. A page cannot become a dependable answer-engine source if robots.txt, noindex, canonical mistakes, CDN bot rules, firewall rules, JavaScript rendering, or weak internal links keep systems from finding and understanding it.

The audit is simple in principle: pick priority URLs, test them as ordinary users and relevant crawlers, confirm indexability and snippet eligibility, check the source graph, then document which AI-related bots are intentionally allowed or blocked.

Step 1: Pick source pages, not every URL

Audit the pages that are supposed to act as sources. Do not start with the whole site.

For an AEO site, priority pages usually include:

definition pages;
methodology pages;
guides;
research notes;
local tools;
case studies;
comparison pages;
glossary pages;
major category or destination hubs.

For a local directory, priority pages include city hubs, area guides, category hubs, strong listing pages, transport guides, itinerary pages, and comparison pages.

Create a small table:

URL	Page job	Target prompt	Desired source surface
`/guides/how-to-ship-pages-that-get-cited/`	Implementation guide	how to make pages get cited	ChatGPT, Google AI features, Perplexity
`/bangkok/where-to-stay/`	Destination decision page	best area to stay in Bangkok	Google, Perplexity, ChatGPT

This keeps the audit tied to actual AEO outcomes.

Step 2: Check ordinary crawl and index eligibility

Confirm the page clears the basic Search floor. Google's AI features guidance says pages must be indexed and eligible to be shown in Google Search with a snippet to appear as supporting links in AI Overviews or AI Mode.

For each URL, check:

HTTP status is 200;
canonical points to itself or the intended final URL;
meta robots does not include noindex;
robots.txt does not block the path;
the page is in the appropriate sitemap;
Search Console URL Inspection does not show a crawl/indexing blocker;
the page can show a snippet.

If a URL fails here, do not move to "AI optimization." Fix the access problem first.

Step 3: Test robots.txt by crawler purpose

AI-related crawlers now map to different purposes. Treat each one separately.

System	User agent	Purpose to evaluate
Google Search	Googlebot	Search and Google AI feature eligibility
OpenAI search	OAI-SearchBot	ChatGPT search inclusion
OpenAI training	GPTBot	Potential model-improvement crawling
OpenAI user action	ChatGPT-User	User-triggered page access
Anthropic training	ClaudeBot	Model-development crawling
Anthropic retrieval/search	Claude-User, Claude-SearchBot	User requests and search relevance
Perplexity	PerplexityBot	Perplexity indexing/source visibility

The goal is not to allow everything. The goal is to avoid accidental policy. A publisher may choose to block training crawlers while allowing search crawlers. That decision should be explicit.

Step 4: Check CDN, firewall, and bot-management behavior

Robots.txt is not the only gate. Hosting, security plugins, CDN rules, bot-management settings, and WAF rules can block crawlers before robots.txt ever matters.

For each priority URL, test:

curl -I https://example.com/page/
curl -A "Googlebot" -I https://example.com/page/
curl -A "OAI-SearchBot" -I https://example.com/page/
curl -A "GPTBot" -I https://example.com/page/
curl -A "Claude-SearchBot" -I https://example.com/page/
curl -A "PerplexityBot" -I https://example.com/page/

Then check server logs or CDN logs for:

403s;
challenges;
rate-limit responses;
redirect loops;
blocked user agents;
blocked IP ranges;
different HTML by user agent.

If the page returns 200 to a browser and 403 to a crawler user agent, the AEO problem is technical before it is editorial.

Step 5: Confirm important content is in text

Important content should be present in crawlable text. Google specifically recommends making important content available in textual form for AI features.

Audit the rendered page and the raw HTML. The important answer should not exist only in:

images;
carousels;
maps;
JavaScript-only components;
hidden tabs;
third-party widgets;
PDFs without a text page;
decorative cards with no supporting copy.

For AEO, each important section should work as a passage:

H2: Which pages are eligible for Google AI Overviews?
First sentence: Pages must be indexed and eligible to appear in Google Search with a snippet before they can appear as supporting links in AI Overviews or AI Mode.
Evidence: Link to Google's AI features documentation.
Internal link: Link to the site's AI crawler/access guide.

That is passage-ready content.

Step 6: Check internal source paths

Answer engines need source clarity, and users need navigation. Internal links should explain why the page exists.

For each priority URL, confirm it has:

one link from a relevant hub;
one link to a deeper methodology or guide;
one link to a related glossary or concept page;
one link to a tool or template if relevant;
descriptive anchor text.

Do not rely on footer links alone. A source page should be reachable through the editorial architecture.

Step 7: Check structured data against visible content

Structured data should match visible content. Google's AI features documentation explicitly calls out structured data matching visible text.

For each page, check:

Article schema uses the visible headline, author, and dates;
FAQ schema reflects visible questions and answers;
LocalBusiness schema reflects visible listing facts;
Product or service schema does not invent unsupported claims;
sameAs links point to real entity profiles;
dates update when the page materially changes.

Bad schema creates false confidence. Good schema clarifies a page that already deserves to be a source.

Step 8: Log the policy decision

Every crawler policy should have a reason. Keep a simple record:

Bot	Rule	Reason	Owner	Review date
OAI-SearchBot	Allow	ChatGPT search inclusion	SEO	2026-06-14
GPTBot	Disallow `/private/`	Exclude private research from training use	Editorial	2026-06-14
Googlebot	Allow	Search and AI feature eligibility	Technical SEO	2026-06-14

The review date matters because vendors change documentation, user agents, and product surfaces.

Step 9: Check source-page depth before blaming access

Once access passes, check whether the page is actually worth retrieving. A technically accessible page can still be weak as a source.

Use this checklist:

Does the page state the direct answer in the first 100 words?
Does each major H2 answer one question or claim?
Is there at least one source, example, table, or concrete detail near each important claim?
Does the page explain limitations?
Does it link to a related guide, tool, glossary page, and methodology page?
Does the title match the prompt family?
Does the page include the entity names an answer system would need?

For example, a crawler-access guide should not only say "check robots.txt." It should name Googlebot, OAI-SearchBot, GPTBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, and PerplexityBot. It should explain why each one is different. It should show a policy table. It should include a testing sequence.

That level of specificity is what separates a source page from a generic blog post.

Step 10: Create an access audit record

The audit should leave behind a record, not just a Slack message. Use a table like this:

Field	Example
URL	`https://example.com/tools/ai-citation-tracker/`
Page job	Tool page
Target prompt	"free AI citation tracker"
Browser status	200
Googlebot status	200
OAI-SearchBot status	200
GPTBot status	403 intentional
Canonical	self
Meta robots	index, follow
Sitemap	yes
Internal link source	`/tools/`
Issue	no methodology link
Next action	add internal link and rerun prompt panel

This record is useful because it separates technical access from content quality. If the prompt panel fails later, you can rule out the obvious access blockers and focus on source quality.

Step 11: Rerun after every infrastructure change

Access audits expire. They should be rerun after:

CDN migration;
security plugin changes;
robots.txt edits;
sitemap plugin changes;
WordPress theme or template changes;
JavaScript rendering changes;
migration from blog URLs to canonical guides;
introduction of a new tool or page type;
major crawler documentation updates.

For a small site, a monthly audit of priority source pages is enough. For a site undergoing a rebuild, run it before and after every launch batch.

What usually fails first?

The most common failure is not a missing AI tag. It is a boring access conflict.

Common failures:

CDN blocks unknown or AI-looking user agents;
robots.txt blocks a directory that now contains canonical guides;
canonical points to an old URL;
sitemap contains noncanonical URLs;
noindex remains after staging;
important content is rendered only after client-side JavaScript;
tool pages are live but not linked from the hub;
security plugin challenges bots;
blocked training crawler is confused with blocked search crawler.

Each failure produces the same symptom: the team thinks the content is weak, but the source path is broken.

What to do Monday morning

1. Pick 20 priority source URLs. 2. Build a crawler-purpose table for Googlebot, OAI-SearchBot, GPTBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, and PerplexityBot. 3. Test each priority URL with browser and crawler user agents. 4. Check canonical, indexability, snippets, sitemap inclusion, and internal links. 5. Move important facts into visible text. 6. Log the policy decision and review date.

Access is not the whole AEO game, but it is the first gate. Do not optimize a page that the right systems cannot fetch.

The AEO Access Audit: How to Check Crawlers, CDNs, Robots.txt, and Source Eligibility

TL;DR

Step 1: Pick source pages, not every URL

Step 2: Check ordinary crawl and index eligibility

Step 3: Test robots.txt by crawler purpose

Step 4: Check CDN, firewall, and bot-management behavior

Step 5: Confirm important content is in text

Step 6: Check internal source paths

Step 7: Check structured data against visible content

Step 8: Log the policy decision

Step 9: Check source-page depth before blaming access

Step 10: Create an access audit record

Step 11: Rerun after every infrastructure change

What usually fails first?

What to do Monday morning

Sources

Related reading

TL;DR

Step 1: Pick source pages, not every URL

Step 2: Check ordinary crawl and index eligibility

Step 3: Test robots.txt by crawler purpose

Step 4: Check CDN, firewall, and bot-management behavior

Step 5: Confirm important content is in text

Step 6: Check internal source paths

Step 7: Check structured data against visible content

Step 8: Log the policy decision

Step 9: Check source-page depth before blaming access

Step 10: Create an access audit record

Step 11: Rerun after every infrastructure change

What usually fails first?

What to do Monday morning

Sources

One careful email, every other week.

Related reading

AEO Content Briefs for Coding Agents: A Complete Specification

Chunking Is Why Clear Sections Get Retrieved More Often

How to Ship Pages That Answer Engines Can Cite