Guide

What Makes a Passage Extractable by Answer Engines

As of May 2026, pages get cited when a specific passage survives access, retrieval, grounding, and source-display constraints. The practical win is to brief and write sections that answer one question cleanly, keep evidence close, and expose dates, authorship, and scope in visible text.

By: CM Charles Morris
Published: May 19, 2026
Version: v1.0
Reading time: ~12 min

If you want a page cited, stop asking whether the whole URL is "AEO optimized" and start asking whether one section can survive being pulled out of context. As of May 2026, public docs from Google, OpenAI, and Anthropic all point toward the same practical reality: answer systems need accessible pages, retrievable text, and sourceable passages. The citation happens later.

This is the next layer after our guides on how answer engines discover, retrieve, and cite pages and how to ship pages that get cited. Those pieces explain the workflow at the page level. This one focuses on the section that actually has to travel.

TL;DR

The unit that travels through answer engines is usually closer to a passage than to a full page. Retrieval research describes selecting candidate contexts, and product docs from Google, OpenAI, and Anthropic all describe systems that fetch, rewrite, ground, or cite at that smaller unit.
Extractable passages tend to share seven traits: crawlable URL, visible answer text, question-shaped heading, immediate scope, nearby evidence, explicit dates, and metadata that matches the rendered page.
The safest workflow is to brief each section like it may be shown alone: one question, one direct answer, one proof block, one scope block, one verification step.

What is an extractable passage?

An extractable passage is a section that still makes sense when the engine sees it apart from the rest of the page. That is the right mental model because open-domain QA systems are built around passage retrieval, not just document retrieval. The Dense Passage Retrieval paper describes question answering as retrieving passages to select candidate contexts, and the original RAG paper frames provenance as an open problem that retrieval helps address.

That is why whole-page quality scores are not enough. If the answer sentence is vague, if the scope sits three scrolls lower, or if the evidence lives in a PDF the crawler never sees, the page can be "good" and still lose the citation.

A concrete page pattern to study is Google's own AI features and your website page. Its sections answer narrow questions directly: how AI features work, how to appear, and how to control preview behavior. Each section can be quoted or cited on its own because the answer is in visible text before the elaboration.

Use this section test before publishing:

If I copy only the heading and first 120-180 words into a doc,
does a reader still know:
- what the answer is
- who it applies to
- what date or version it reflects
- what evidence backs it

If the answer is no, the passage is not ready.

Why do answer engines retrieve passages instead of trusting the whole page?

They retrieve passages because the user usually asked a narrower question than the page title. Google says AI Overviews and AI Mode may use query fan-out across subtopics and data sources. OpenAI says ChatGPT search may rewrite a prompt into a search query. Anthropic says Claude processes multiple sources and includes citations in each web-search response.

That means the engine is not necessarily matching your H1 to the user's prompt. It may be matching one follow-up question to one answer block.

The practical implication is simple: brief pages around sub-questions, not just one parent keyword. On Google's own AI features page, the retrievable sub-questions include whether special markup is required, what makes a page eligible as a supporting link, and which controls limit preview text. If those answers were implied only through long narrative prose, a rewriting or fan-out step would have less to grab cleanly.

Real examples to inspect:

ChatGPT search explicitly documents prompt rewriting.
AI features and your website explicitly documents query fan-out.
Our piece on chunking and passage retrieval shows why section boundaries matter even before a model starts composing the answer.

Those pages are not just useful sources. They are useful models for how to write answerable sections.

Why access and indexability come before writing quality

An elegant answer block is useless if the relevant crawler cannot fetch it. Google's technical requirements say a page must not block Googlebot, must return HTTP 200, and must contain indexable content. Google also says AI Overview and AI Mode supporting links must be indexed and eligible to show with a snippet. OpenAI says publishers that allow OAI-SearchBot can receive trackable ChatGPT referrals, while noindex can stop even bare-link surfacing. Anthropic says disabling Claude-SearchBot can reduce visibility in search results and disabling Claude-User can reduce visibility for user-directed retrieval.

This is the first extraction gate:

Gate	Documented failure mode	Practical effect
Crawl access	Bot blocked in `robots.txt` or by infrastructure	Passage never fetched
Indexability	Non-200 response, private page, or non-indexable content	URL may not qualify for search-based citation
Snippet eligibility	`noindex` or restrictive preview controls	Page can lose the text surface needed for citation exposure

Real page pattern:

Google Search technical requirements is a short checklist with pass/fail conditions.
Publishers and Developers – FAQ is explicit about OAI-SearchBot and noindex.
Does Anthropic crawl data from the web… separates ClaudeBot, Claude-User, and Claude-SearchBot by purpose.

Copyable pre-write checklist:

Confirm the target URL is public and returns HTTP 200.
Confirm relevant content is in rendered text, not only in an image, widget, or download.
Confirm Googlebot, OAI-SearchBot, and the Anthropic bot you care about are not unintentionally blocked.
Confirm the page is allowed to expose a snippet if citation visibility matters.

Why question-shaped headings and direct-answer openings travel better

They travel better because they lower the amount of inference the engine has to do. Google recommends making important content available in textual form. Schema.org's FAQ page says on-page markup helps search engines understand information on pages and provide richer results. Neither statement guarantees citation. Both support the same writing move: make the answer obvious in the rendered page.

A strong extractable block usually looks like this:

## How often should we rerun an AEO citation check?

Rerun it every two weeks, and rerun immediately after a major content,
product, or platform change.

Compare that with:

## Monitoring and iteration

Visibility can change quickly across answer engines.

The second version implies the answer. The first states it.

Real examples worth copying:

Google Search technical requirements uses headings that map to explicit pass conditions.
ChatGPT search uses plain-language sections like how search works and what information is shared.
Enabling and using web search opens with a direct description of when Claude invokes search and how citations appear.

If your team briefs outlines in a doc, add this rule: every H2 must be either the reader's question or the page's claim, and the first sentence under it must answer that heading directly.

How much evidence should sit next to the answer?

Enough that the passage can be trusted without forcing the engine or the reader to hunt for support elsewhere. The RAG paper's argument for explicit non-parametric memory is useful here: provenance is hard when the system has to generate from general model memory alone. The closer your evidence is to the claim, the less reconstruction the answer engine has to do.

That does not mean every paragraph needs a footnote stack. It means the answer block should carry the proof shape with it.

A practical pattern:

## Does FAQ schema guarantee AI citations?

No. Google's documentation says there are no additional technical requirements and no special AI markup required, and structured data should match visible page text.

What it can still do:
- clarify page entities and page type
- support rich result eligibility where applicable
- make answer blocks easier to interpret consistently

This is stronger than burying the limitation in the tenth paragraph. It also keeps causation honest.

Real pages to inspect:

General structured data guidelines repeatedly tie markup back to visible information on the page.
Introduction to structured data markup in Google Search makes the page-level rule explicit: the markup describes the content of that page.

The editorial rule is: put the answer, the caveat, and the evidence handle in the same section.

What metadata helps an extractable passage without replacing it?

Metadata helps when it confirms what the user can already see. It does not rescue thin visible copy. Google says structured data should match visible text. Schema.org's Article and datePublished definitions give a standard place to express authorship and publication timing. Google also says byline dates work best when the date is user-visible, labeled clearly, and reflects the page's publication or update date.

That points to a pragmatic stack:

Element	What it helps with	What it does not do
Visible `Published` or `Last updated` date	Gives engines and readers explicit freshness context	It does not prove the content is current
`datePublished` / `dateModified`	Mirrors publication metadata in a standard field	It does not override contradictory visible dates
`Article` or a more specific schema type	Clarifies the page type and properties	It does not make an unclear passage extractable
Matching structured data and visible text	Reduces ambiguity	It does not create evidence that is not on the page

Real example:

Influence your byline dates in Google Search tells publishers to add a visible date and label it.
Article – Schema.org Type and datePublished show the standard properties available for content pages.

Use this implementation check:

one visible date near the title or byline
matching datePublished
dateModified only when the page was meaningfully updated
author or organization clearly named in visible copy and markup

How should you brief a page so each section stands on its own?

Brief the section, not just the URL. Most content briefs still over-focus on title, keyword, and internal links, then leave the body to ad-lib. That is backwards for answer engines. The section is the retrievable unit.

A workable AEO brief for one section should include:

question:
direct_answer:
scope:
proof_block:
counterpoint:
date_or_version:
verification_query:

Here is a filled example using a real documented topic:

question: "Do you need special markup to appear in Google AI Overviews?"
direct_answer: "No. Google says there are no additional technical requirements and no special AI markup required."
scope: "Google Search AI features, as documented in May 2026"
proof_block:
  - "Page must be indexed and eligible for a snippet"
  - "Important content should be available in textual form"
counterpoint: "Meeting requirements does not guarantee indexing or serving"
date_or_version: "As of May 2026"
verification_query: "google ai overviews special schema required"

That brief can be handed to a writer, editor, or SME and still produce a section with extraction value.

Real model page:

AI features and your website already answers this exact question with scope, conditions, and limits.

Which page types tend to produce extractable passages most reliably?

Source-of-truth docs, methodology pages, FAQ/Q&A pages, and comparison pages usually produce them most reliably because they force the writer to answer narrower questions. That is not a guarantee of citation. It is a structural advantage.

Real patterns from the source set:

Publishers and Developers – FAQ: strong because each question has a scoped answer and an operational rule.
Does Anthropic crawl data from the web…: strong because the bot distinctions separate purpose, behavior, and consequence.
FAQ – schema.org: strong because the format itself exposes discrete answers rather than one long essay.

For product and category pages, the lesson is not "turn everything into FAQ schema." The lesson is to embed source-of-truth answer blocks inside the page. A pricing page needs an extractable section for contract minimums. A feature page needs one for prerequisites and limits. A benchmark page needs one for sample, method, and exclusions.

If a section cannot answer a narrow question without the rest of the article, it is usually too blended to travel well.

How should you verify extractability after publishing?

Verify the passage, not just the page. Search Console can tell you whether Google can crawl and index the URL. It cannot tell you whether your answer block is the one being surfaced. For that, you need a section-level test.

Use a simple panel:

URL:
Section heading:
Test query:

Checks:
- Is the answer sentence visible in raw rendered HTML?
- Is the scope or date within the first screen of that section?
- Is the proof block adjacent to the answer?
- Is the page eligible for crawl and indexing?
- Does the answer engine mention or cite this exact URL, a sibling URL, or neither?

Run at least one query aimed at the section, not only at the page's top keyword. For the Google example above, the query is not just google ai overviews. It is google ai overviews special schema required.

A reproducible verification stack:

1. Use the URL Inspection tool or equivalent logs to confirm crawl and indexing status where available. 2. Inspect rendered HTML to confirm the section's answer is plain text. 3. Search the exact sub-question in the engine or observation workflow you use. 4. Log whether the engine cited the right URL, the wrong URL from your site, a competitor, or no source at all.

That last distinction matters. "No citation" and "wrong internal URL" are different problems.

What can explain a miss besides weak writing?

A miss can come from interface, access, or evidence problems before it comes from style. Google says AI Mode and AI Overviews may show different links because they may use different models and techniques. ChatGPT may surface links differently depending on the response format. Claude's docs are explicit about citation-rich search responses, but Anthropic also splits user retrieval and search optimization across different bots.

So do not jump from "we were not cited" to "rewrite the article" without checking:

Was the page crawlable by the relevant bot?
Was the answer in visible text?
Was the section too broad for the rewritten or fanned-out query?
Did the engine expose sources in that interface at all?
Did another page answer the narrower question more directly?

The contrarian point is that a lot of AEO advice treats citation misses as writing failures because writing is easier to change than crawl policy, information architecture, or page type. That diagnosis is often too flattering to the workflow and too harsh on the paragraph.

If you need a companion checklist for access controls, pair this article with our earlier comparison of robots.txt vs. llms.txt.

What to do Monday morning

1. Pick one high-value page and rewrite two H2 sections so the first sentence answers the heading directly in visible text. 2. Add one scope block to that page with date, version, region, or audience so the answer can survive being quoted alone. 3. Move one critical proof element next to the claim it supports instead of burying it later on the page or in a downloadable asset. 4. Audit whether the target URL is open to Googlebot, OAI-SearchBot, and the Anthropic bot that matches your use case. 5. Check that visible dates, datePublished, and dateModified do not contradict each other. 6. Rewrite your content brief template so every planned section includes question, direct_answer, scope, proof_block, and verification_query. 7. Run one section-level query per page in your tracking workflow and log whether the engine cited the exact URL, a sibling URL, or no source.

What Makes a Passage Extractable by Answer Engines

TL;DR

What is an extractable passage?

Why do answer engines retrieve passages instead of trusting the whole page?

Why access and indexability come before writing quality

Why question-shaped headings and direct-answer openings travel better

How much evidence should sit next to the answer?

What metadata helps an extractable passage without replacing it?

How should you brief a page so each section stands on its own?

Which page types tend to produce extractable passages most reliably?

How should you verify extractability after publishing?

What can explain a miss besides weak writing?

What to do Monday morning

Sources

Related reading

TL;DR

What is an extractable passage?

Why do answer engines retrieve passages instead of trusting the whole page?

Why access and indexability come before writing quality

Why question-shaped headings and direct-answer openings travel better

How much evidence should sit next to the answer?

What metadata helps an extractable passage without replacing it?

How should you brief a page so each section stands on its own?

Which page types tend to produce extractable passages most reliably?

How should you verify extractability after publishing?

What can explain a miss besides weak writing?

What to do Monday morning

Sources

One careful email, every other week.

Related reading

AI Citation Audit Playbook: How to Find the Pages Answer Engines Should Cite

Entity-First AEO: How to Make a Site Understandable Before Optimizing Pages

Robots.txt Is the AEO Control Plane. Llms.txt Is Only a Map.