Extractability: a quotable paragraph leads with the answer, is 40-60 words, lives under semantic HTML, and names entities concretely
Synthesis of GEO paper + Ahrefs content-helper findings + RAG chunking literature + the Digital Bloom 2025 report:
A paragraph is extractable when it:
- Leads with the answer, not the setup
- Is self-contained (a reader landing here understands it without prior context)
- Names entities concretely (proper nouns, dates, statistics, prices)
- Cites a source with a named author or institution
- Lives in clear semantic HTML — under a meaningful
<h2>/<h3>, not a carousel<div> - Is 40-60 words (matches featured-snippet research showing 45-word paragraph snippets appear most frequently)
- Includes a statistic or a direct quotation
A paragraph is invisible when it: hedges ("could", "may"); references "above" / "below" (chunkers strip context); names entities vaguely ("several companies say…"); lives in a JS-rendered component or PDF without text fallback; duplicates across many pages without unique facts.
Sources: Princeton GEO paper (see Princeton GEO paper (Aggarwal et al., KDD '24) — the foundational generative engine optimization study); Digital Bloom 2025 AI Citation & LLM Visibility Report; averi.ai LLM-Optimized Content Structures guide; HiChunk / W-RAC / hierarchical segmentation studies (arXiv 2024-2026).
Confidence: Industry-consensus.