An AEO access audit checks whether the pages you want cited can actually be fetched, indexed, snippet-eligible, internally discovered, and retrieved by the systems you care about. It should happen before content rewrites, schema projects, or citation tracking.
TL;DR
Start with access. A page cannot become a dependable answer-engine source if robots.txt, noindex, canonical mistakes, CDN bot rules, firewall rules, JavaScript rendering, or weak internal links keep systems from finding and understanding it.
The audit is simple in principle: pick priority URLs, test them as ordinary users and relevant crawlers, confirm indexability and snippet eligibility, check the source graph, then document which AI-related bots are intentionally allowed or blocked.
Step 1: Pick source pages, not every URL
Audit the pages that are supposed to act as sources. Do not start with the whole site.
For an AEO site, priority pages usually include:
- definition pages;
- methodology pages;
- guides;
- research notes;
- local tools;
- case studies;
- comparison pages;
- glossary pages;
- major category or destination hubs.
For a local directory, priority pages include city hubs, area guides, category hubs, strong listing pages, transport guides, itinerary pages, and comparison pages.
Create a small table:
| URL | Page job | Target prompt | Desired source surface |
|---|---|---|---|
/guides/how-to-ship-pages-that-get-cited/ |
Implementation guide | how to make pages get cited | ChatGPT, Google AI features, Perplexity |
/bangkok/where-to-stay/ |
Destination decision page | best area to stay in Bangkok | Google, Perplexity, ChatGPT |
This keeps the audit tied to actual AEO outcomes.
Step 2: Check ordinary crawl and index eligibility
Confirm the page clears the basic Search floor. Google's AI features guidance says pages must be indexed and eligible to be shown in Google Search with a snippet to appear as supporting links in AI Overviews or AI Mode.
For each URL, check:
- HTTP status is
200; - canonical points to itself or the intended final URL;
- meta robots does not include
noindex; - robots.txt does not block the path;
- the page is in the appropriate sitemap;
- Search Console URL Inspection does not show a crawl/indexing blocker;
- the page can show a snippet.
If a URL fails here, do not move to "AI optimization." Fix the access problem first.
Step 3: Test robots.txt by crawler purpose
AI-related crawlers now map to different purposes. Treat each one separately.
| System | User agent | Purpose to evaluate |
|---|---|---|
| Google Search | Googlebot | Search and Google AI feature eligibility |
| OpenAI search | OAI-SearchBot | ChatGPT search inclusion |
| OpenAI training | GPTBot | Potential model-improvement crawling |
| OpenAI user action | ChatGPT-User | User-triggered page access |
| Anthropic training | ClaudeBot | Model-development crawling |
| Anthropic retrieval/search | Claude-User, Claude-SearchBot | User requests and search relevance |
| Perplexity | PerplexityBot | Perplexity indexing/source visibility |
The goal is not to allow everything. The goal is to avoid accidental policy. A publisher may choose to block training crawlers while allowing search crawlers. That decision should be explicit.
Step 4: Check CDN, firewall, and bot-management behavior
Robots.txt is not the only gate. Hosting, security plugins, CDN rules, bot-management settings, and WAF rules can block crawlers before robots.txt ever matters.
For each priority URL, test:
curl -I https://example.com/page/ curl -A "Googlebot" -I https://example.com/page/ curl -A "OAI-SearchBot" -I https://example.com/page/ curl -A "GPTBot" -I https://example.com/page/ curl -A "Claude-SearchBot" -I https://example.com/page/ curl -A "PerplexityBot" -I https://example.com/page/
Then check server logs or CDN logs for:
- 403s;
- challenges;
- rate-limit responses;
- redirect loops;
- blocked user agents;
- blocked IP ranges;
- different HTML by user agent.
If the page returns 200 to a browser and 403 to a crawler user agent, the AEO problem is technical before it is editorial.
Step 5: Confirm important content is in text
Important content should be present in crawlable text. Google specifically recommends making important content available in textual form for AI features.
Audit the rendered page and the raw HTML. The important answer should not exist only in:
- images;
- carousels;
- maps;
- JavaScript-only components;
- hidden tabs;
- third-party widgets;
- PDFs without a text page;
- decorative cards with no supporting copy.
For AEO, each important section should work as a passage:
H2: Which pages are eligible for Google AI Overviews? First sentence: Pages must be indexed and eligible to appear in Google Search with a snippet before they can appear as supporting links in AI Overviews or AI Mode. Evidence: Link to Google's AI features documentation. Internal link: Link to the site's AI crawler/access guide.
That is passage-ready content.
Step 6: Check internal source paths
Answer engines need source clarity, and users need navigation. Internal links should explain why the page exists.
For each priority URL, confirm it has:
- one link from a relevant hub;
- one link to a deeper methodology or guide;
- one link to a related glossary or concept page;
- one link to a tool or template if relevant;
- descriptive anchor text.
Do not rely on footer links alone. A source page should be reachable through the editorial architecture.
Step 7: Check structured data against visible content
Structured data should match visible content. Google's AI features documentation explicitly calls out structured data matching visible text.
For each page, check:
- Article schema uses the visible headline, author, and dates;
- FAQ schema reflects visible questions and answers;
- LocalBusiness schema reflects visible listing facts;
- Product or service schema does not invent unsupported claims;
sameAslinks point to real entity profiles;- dates update when the page materially changes.
Bad schema creates false confidence. Good schema clarifies a page that already deserves to be a source.
Step 8: Log the policy decision
Every crawler policy should have a reason. Keep a simple record:
| Bot | Rule | Reason | Owner | Review date |
|---|---|---|---|---|
| OAI-SearchBot | Allow | ChatGPT search inclusion | SEO | 2026-06-14 |
| GPTBot | Disallow /private/ |
Exclude private research from training use | Editorial | 2026-06-14 |
| Googlebot | Allow | Search and AI feature eligibility | Technical SEO | 2026-06-14 |
The review date matters because vendors change documentation, user agents, and product surfaces.
Step 9: Check source-page depth before blaming access
Once access passes, check whether the page is actually worth retrieving. A technically accessible page can still be weak as a source.
Use this checklist:
- Does the page state the direct answer in the first 100 words?
- Does each major H2 answer one question or claim?
- Is there at least one source, example, table, or concrete detail near each important claim?
- Does the page explain limitations?
- Does it link to a related guide, tool, glossary page, and methodology page?
- Does the title match the prompt family?
- Does the page include the entity names an answer system would need?
For example, a crawler-access guide should not only say "check robots.txt." It should name Googlebot, OAI-SearchBot, GPTBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, and PerplexityBot. It should explain why each one is different. It should show a policy table. It should include a testing sequence.
That level of specificity is what separates a source page from a generic blog post.
Step 10: Create an access audit record
The audit should leave behind a record, not just a Slack message. Use a table like this:
| Field | Example |
|---|---|
| URL | https://example.com/tools/ai-citation-tracker/ |
| Page job | Tool page |
| Target prompt | "free AI citation tracker" |
| Browser status | 200 |
| Googlebot status | 200 |
| OAI-SearchBot status | 200 |
| GPTBot status | 403 intentional |
| Canonical | self |
| Meta robots | index, follow |
| Sitemap | yes |
| Internal link source | /tools/ |
| Issue | no methodology link |
| Next action | add internal link and rerun prompt panel |
This record is useful because it separates technical access from content quality. If the prompt panel fails later, you can rule out the obvious access blockers and focus on source quality.
Step 11: Rerun after every infrastructure change
Access audits expire. They should be rerun after:
- CDN migration;
- security plugin changes;
- robots.txt edits;
- sitemap plugin changes;
- WordPress theme or template changes;
- JavaScript rendering changes;
- migration from blog URLs to canonical guides;
- introduction of a new tool or page type;
- major crawler documentation updates.
For a small site, a monthly audit of priority source pages is enough. For a site undergoing a rebuild, run it before and after every launch batch.
What usually fails first?
The most common failure is not a missing AI tag. It is a boring access conflict.
Common failures:
- CDN blocks unknown or AI-looking user agents;
- robots.txt blocks a directory that now contains canonical guides;
- canonical points to an old URL;
- sitemap contains noncanonical URLs;
- noindex remains after staging;
- important content is rendered only after client-side JavaScript;
- tool pages are live but not linked from the hub;
- security plugin challenges bots;
- blocked training crawler is confused with blocked search crawler.
Each failure produces the same symptom: the team thinks the content is weak, but the source path is broken.
What to do Monday morning
1. Pick 20 priority source URLs. 2. Build a crawler-purpose table for Googlebot, OAI-SearchBot, GPTBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, and PerplexityBot. 3. Test each priority URL with browser and crawler user agents. 4. Check canonical, indexability, snippets, sitemap inclusion, and internal links. 5. Move important facts into visible text. 6. Log the policy decision and review date.
Access is not the whole AEO game, but it is the first gate. Do not optimize a page that the right systems cannot fetch.