Guide

How answer engines discover, retrieve, and cite pages

Pages get cited only after surviving six gates: access, discovery, retrieval, passage selection, answer composition, citation display. Each fails differently. This guide gives you the diagnostic model so you can find which gate failed before rewriting content.

By: CM Charles Morris
Published: May 7, 2026
Version: v1.0
Reading time: ~13 min

Most failed AEO work happens because teams rewrite the page before finding which gate actually failed. They see a missing citation, assume the content needs work, and start editing — without ever checking whether the bot got blocked, the page got indexed, or the right passage got retrieved.

This guide is the diagnostic model. Pages get cited only after surviving a chain of six gates: access, discovery, retrieval, passage selection, answer composition, and citation display. Each gate fails differently. Each gate has different fixes. If you don’t know which gate is broken, you’re guessing.

Once you have the model, AEO work gets simpler. You stop rewriting hopefully and start fixing specifically.

tl;dr

Six gates determine whether a page gets cited. Access → Discovery → Retrieval → Passage selection → Answer composition → Citation display. Each can fail independently.
Citation is not the same as ranking. A page can be indexed, mentioned, summarized, or used as background without ever being shown as the visible source.
Diagnose before you rewrite. Confirm access, then indexability, then textual clarity, then passage usefulness, then run a prompt panel. Skipping steps wastes work.
Mentions without citations are a different problem than no visibility at all. They need different fixes.

The mental model: six gates

Answer engines work like a pipeline, not a magic box. A practical AEO audit asks one question at each stage:

Gate	Question	Failure symptom
Access	Can the system fetch the page?	No crawl → no retrieval → no citation
Discovery	Can the system find the URL?	Page exists but is invisible
Retrieval	Does the page match the prompt?	Competitors cited instead
Passage selection	Is the best section self-contained?	Page mentioned but weakly summarized
Answer composition	Does the page support a useful claim?	Generic answer, no source
Citation display	Does the platform expose this source?	Used or remembered, but not cited

This model is the diagnostic structure for the rest of the guide. Each section below addresses one gate, in order. If robots.txt blocks a search bot, rewriting an H2 is theater. If the page is accessible but every claim is vague, crawler access is not the problem.

Gate 1: Access — can the system fetch the page?

Access is the first failure point because it’s the cheapest to fail and the most invisible. A page that can’t be fetched can’t be cited, no matter how well-written it is.

The major AI companies have moved beyond simple “AI bot or not.” Google, OpenAI, and Anthropic now document multiple bots per company, each doing a different job:

Platform	Training	Search/index	User-triggered
Google	`Google-Extended` (some non-Search AI uses)	`Googlebot`	Varies
OpenAI	`GPTBot`	`OAI-SearchBot`	`ChatGPT-User`
Anthropic	`ClaudeBot`	`Claude-SearchBot`	`Claude-User`

The practical rule: don’t talk about “AI bots” as one thing. Build a policy table by platform, user agent, and purpose, and decide each cell deliberately.

A working starter configuration that allows search visibility while blocking training:

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: ClaudeBot
Disallow: /

This isn’t a universal recommendation. It’s a template for making the decision explicit. Legal, security, licensing, and brand teams may pick different policies. The AEO point is that search visibility and training permission shouldn’t be conflated. (For deeper coverage, see the robots.txt vs llms.txt reference.)

When you check access, also check what’s actually live, not just what’s in your repo. Most sites have managed robots.txt features, CDN rules, WAF settings, and bot-management products that change what crawlers see. Fetch the file as a crawler would (curl https://yoursite.com/robots.txt) and compare against your intended policy.

Gate 2: Discovery — can the system find the URL?

Discovery is the gap between “the page is technically accessible” and “the system knows the page exists.” A page in a clean robots.txt is still invisible if no link points to it, no sitemap lists it, and no historic crawl found it.

Three discovery channels matter:

Internal linking. Pages without inbound links from your own site are unlikely to be discovered. New pages need links from your homepage, hub pages, or relevant pillar guides.

XML sitemaps. Sitemaps don’t guarantee crawl, but they’re the easiest signal. Make sure new URLs appear in sitemap.xml and that the sitemap is referenced in robots.txt. Submit important new pages directly via Google Search Console’s URL Inspection.

External signals. Pages that get linked, shared, or referenced externally are easier for systems to find. Brand mentions in third-party publications also help — answer engines often discover URLs from the same articles users do.

Google specifically requires that for a page to be eligible as a supporting link in AI Overviews or AI Mode, it must be indexed and eligible to appear in Search with a snippet. That’s a discoverability gate stated in Google’s own documentation. If a page isn’t indexed, no AI feature will cite it.

For non-Google engines, the bar is fuzzier — but the same logic applies. If your URL isn’t in a system’s index, fetch path, or training data, the page might as well not exist.

Gate 3: Retrieval — does the page match the prompt?

Retrieval is where most AEO work lives, and where most teams overinvest in headlines while underinvesting in passage match. The system has a query (the user’s prompt or a derived sub-query), candidate documents, and some scoring that picks the best passages from the best documents.

Google documents query fan-out for AI Mode and AI Overviews, where the system runs multiple related searches across subtopics and data sources to support a single response. OpenAI and Anthropic don’t publish their selection formulas. What we can infer is:

A page should match a query at the passage level, not just the title level. If the prompt is “how do I compare SOC 2 Type I and Type II for a vendor review?”, you need a section that answers that specific question. A generic “Security” page may be accessible and authoritative but still lose retrieval to a clearer explainer.

A retrievable passage usually has:

A question-shaped heading
A direct first sentence that answers the heading
The entity named in the passage, not only in the title
A concrete example or table near the claim
A date or scope limit when the answer can change
A link to the source-of-truth page or documentation

That’s not a ranking formula. It’s content hygiene for retrieval systems that evaluate chunks, snippets, sections, or passages — not whole pages. (For more on this, see the chunking and passage retrieval reference.)

Gate 4: Passage selection — is the best section self-contained?

Passage selection is the gate retrieval feeds into. The system has decided your page is relevant; now it has to pick which chunk of your page to summarize or quote. If the right chunk depends on three other sections of the page to make sense, the model loses context.

A self-contained passage:

Names the entity in the passage (not just “this works” but “Linear’s Cycles work…”)
Includes the qualifier in the same chunk as the claim (“…for teams under 50 people”)
Provides the example in the same section, not in a separate appendix
Carries enough context that someone reading just this section understands what’s being said

This is where pages with otherwise good content fail. The intro paragraph defines the entity. Section 4 makes the claim. Section 7 has the qualifier. A retrieval system pulling section 4 alone gets the claim without the entity or the qualifier — and either drops the citation or describes your page incorrectly.

The fix is local clarity. Repeat the entity name when it’s load-bearing. Keep caveats next to claims. Put examples in the same H2 section as the heading they illustrate. The cost is some redundancy in human reading; the benefit is dramatically better passage-level retrieval.

Gate 5: Answer composition — does the page support a useful claim?

Once a passage is selected, the model has to decide whether to use it. Sometimes it does, but it doesn’t cite — it uses your content as background and synthesizes an answer from multiple sources. Sometimes it cites, but describes your content inaccurately. Sometimes it cites correctly.

Tracking these states separately matters:

State	Meaning	What to do
Not visible	Brand/page absent from answer	Check access, indexability, prompt match
Mentioned	Brand named but no source shown	Improve source-of-truth pages and off-site corroboration
Cited	Page shown as source	Improve passage accuracy and conversion path
Mis-cited	Wrong page or stale source cited	Consolidate, canonicalize, update internal links

A mention without a citation isn’t automatically a failure. The brand was recognized, recommended, or described — that’s commercial value, even without a click. But it’s a different state from “cited,” and the fix is different. Mentions usually need stronger source-of-truth pages and better off-site corroboration; citations usually need clearer passage structure and accurate visible content.

Gate 6: Citation display — does the platform expose this source?

The final gate is a platform decision largely outside your control. Different engines display citations differently, sometimes showing all sources, sometimes showing none, sometimes showing only a few even when many were used.

What we know publicly:

Google says AI features surface relevant links to help users find information and explore further.
OpenAI says OAI-SearchBot is for appearing in ChatGPT search results.
Anthropic says Claude-SearchBot improves search result quality.

That’s enough to set expectations but not enough to claim a universal citation algorithm. Don’t promise a client that adding a table, schema, or llms.txt file will cause citations. Promise a testable improvement in eligibility, clarity, and evidence — and verify with prompt panels after publishing.

What makes a page citable

A citable page makes one clear claim per section and gives the answer engine a reason to trust that claim. Pages that get cited tend to fall into a few archetypes:

Page type	Why it gets cited	Example section
Documentation	It is the source of truth	“API rate limits by plan”
Methodology	It explains how data was produced	“How we collect AI visibility prompts”
Comparison	It resolves buyer choice	“SOC 2 Type I vs Type II”
Research	It contains original data	“Sample: 590M searches analyzed”
Policy	It controls a decision	“Crawler access policy”
FAQ/support	It answers a narrow task	“How to export raw citation data”

The page also needs visible text. Google’s documentation specifically recommends making important content available in textual form for AI features. Hidden content, image-only claims, heavy client-side rendering, and vague copy all make extraction harder.

A note on schema: don’t add FAQ or HowTo schema as an AEO tactic. Google restricted FAQ rich results to gov/health sites in 2023 and tightened further in 2026. HowTo rich results are deprecated. The schema worth investing in is accurate Article markup that matches your visible content — that’s the markup that helps AI engines verify entity relationships, not trigger rich results that no longer appear.

How to verify a page after shipping

Verification should be a small, repeatable prompt panel — not a one-off screenshot. Use stable prompts and record the engine, mode, date, exact answer, cited URLs, and whether your target page appeared.

A starter panel:

1. What is the difference between [problem A] and [problem B]?
2. Best [category] tools for [specific user type]
3. How does [brand] handle [policy or feature]?
4. [brand] vs [competitor] for [use case]
5. How do I implement [task] in [platform]?

Run the panel three times: at publish, after indexing (2-7 days later), and again two weeks after that. Citation behavior takes time to stabilize, and engines change models on their own schedules.

Also inspect server logs or CDN logs for relevant crawlers. If no relevant bot ever requests the page, the prompt test alone isn’t enough. If bots request the page but the answer cites a competitor, the issue is likely relevance, evidence, or authority — not access.

Diagnosing a mention without a citation

A mention without a citation usually means the engine recognizes the brand or concept but doesn’t expose your page as the supporting source. This needs a different fix from “no visibility at all.” Start by classifying what kind of mention you’re looking at:

Mention type	Example	Likely fix
Brand-only mention	“Tools include Ahrefs, Semrush, and HubSpot”	Build citable product and methodology pages
Category association	“Ahrefs offers AI visibility tracking”	Strengthen source-of-truth pages and docs
Unsupported recommendation	“Use Brand X for enterprise teams”	Add proof, comparisons, customer evidence
Wrong description	“Brand X is a social listening tool”	Fix entity consistency across owned and third-party pages
Old positioning	Old product name or acquired brand appears	Update internal links, redirects, schema, external profiles

Then ask: does a source page exist for the claim the engine made? If the answer says your product tracks AI citations but your product page never defines “citation,” the system may cite a help doc, review page, or competitor comparison instead. That’s not mysterious — the engine is looking for a page that states the claim more clearly than yours does.

The fix is to build a source hierarchy:

Primary source:
- Product or documentation page that states the current claim.

Supporting source:
- Methodology page explaining how the claim is measured.

Proof source:
- Case study, dataset, benchmark, screenshot, customer example.

Clarifying source:
- FAQ, glossary, comparison, or support article answering narrow follow-ups.

When that hierarchy exists, answer engines have several possible citation targets that agree with each other. When it doesn’t, a third-party article often becomes the cleanest source — and you lose the citation to someone who summarizes you better than you do.

Common failure modes

Five patterns I see repeatedly:

1. Blocked access. A site blocks the crawler that matters while allowing the one that doesn’t. Often this is unintentional — a copy-pasted “block AI bots” config from 2023 that pre-dates the search/training/user split.

2. Crawlable but vague. “Our platform helps teams work faster” is not a citable claim. “Our SOC 2 Type II report covers the period from January 1 to December 31, 2025” is. Vague claims technically index but lose retrieval to clearer sources.

3. Buried evidence. Original data sits behind a PDF, image, accordion, script, or gated report. Answer systems cite a third-party article that summarizes it instead. The summarizing article wins because it’s text-accessible.

4. No source hierarchy. A blog post, help article, product page, and docs page all answer the same question slightly differently. The engine picks one — and it may not be the one you wanted.

5. Treating a single test as proof. One citation result is a hypothesis, not a victory. Engines vary by prompt wording, logged-in state, location, model, mode, freshness, and retrieval path. Run the panel three times before drawing conclusions.

What an AEO-ready page includes

Use this template as a starting structure for any AEO-targeted page:

# [Specific page title]

Updated: [Month Day, Year]
By: [Author/entity]

## tl;dr
- [Direct answer]
- [Key evidence]
- [Who this applies to]

## [Question the buyer or searcher actually asks]
[Direct answer in sentence one.]
[Concrete example.]
[Caveat or scope.]

## Evidence
[Data, docs, screenshots, methodology, source-of-truth links.]

## How to verify this
[Prompts, tests, logs, or checklist.]

This structure helps both humans and machines. It keeps the answer close to the evidence and gives future reviewers a way to test whether the page still deserves citation. (For the full workflow, see How to ship pages that answer engines can cite.)

What you should not infer from one citation test

You should not infer a durable rule from one test. A single test can tell you:

The page appeared (or didn’t) for that prompt at that time
The engine preferred certain source types
The answer phrasing was accurate or inaccurate
The visible citation was your page, a competitor, a third party, or none

A single test cannot prove:

The platform always prefers that source type
A content change caused a citation
Schema alone changed retrieval
A missing citation means the page wasn’t used at all
A citation means the page is commercially effective

The output of a test should be a hypothesis and a next action, not a victory lap. If a docs page is cited twice and the product page never, the next action is to add clearer source-of-truth sections to the product page and re-run. If a competitor wins every comparison prompt, the next action may be a comparison asset, not another how-to article.

What to do Monday morning

Pick five important pages and map each one to the six gates: access, discovery, retrieval, passage selection, answer composition, citation display. For each gate, mark green/yellow/red based on whether you have evidence the gate is clear.
Review robots.txt, CDN rules, and bot-management settings for Googlebot, OAI-SearchBot, Claude-SearchBot, and user-triggered agents. Compare what’s live to what you intend.
Rewrite the top three H2s on each page so the first sentence directly answers a real prompt — not a topic.
Add one evidence block to every page: methodology, source links, table, example, or original data.
Build a 20-prompt verification panel and run it every two weeks. Log mentions and citations in separate columns.
Fix stale or wrong cited sources before publishing new pages. Cleaning up wrong citations is higher leverage than producing more content.
When something works, change one thing at a time so you can attribute the lift.

The boring takeaway: AEO is a diagnostic discipline, not a content strategy. Pages get cited when every gate is clear. Pages don’t get cited when one gate is broken — and the broken gate is usually not the one you assumed.

Sources

AI features and your website (Google, accessed 2026-05-07)
Overview of OpenAI Crawlers (OpenAI)
Does Anthropic crawl data from the web? (Anthropic)
Article structured data (Google)
Changes to HowTo and FAQ rich results (Google Search Central)