Extractability: a quotable paragraph leads with the answer, is 40-60 words, lives under semantic HTML, and names entities concretely

Synthesis of GEO paper + Ahrefs content-helper findings + RAG chunking literature + the Digital Bloom 2025 report:

A paragraph is extractable when it:

  1. Leads with the answer, not the setup
  2. Is self-contained (a reader landing here understands it without prior context)
  3. Names entities concretely (proper nouns, dates, statistics, prices)
  4. Cites a source with a named author or institution
  5. Lives in clear semantic HTML — under a meaningful <h2>/<h3>, not a carousel <div>
  6. Is 40-60 words (matches featured-snippet research showing 45-word paragraph snippets appear most frequently)
  7. Includes a statistic or a direct quotation

A paragraph is invisible when it: hedges ("could", "may"); references "above" / "below" (chunkers strip context); names entities vaguely ("several companies say…"); lives in a JS-rendered component or PDF without text fallback; duplicates across many pages without unique facts.

Sources: Princeton GEO paper (see Princeton GEO paper (Aggarwal et al., KDD '24) — the foundational generative engine optimization study); Digital Bloom 2025 AI Citation & LLM Visibility Report; averi.ai LLM-Optimized Content Structures guide; HiChunk / W-RAC / hierarchical segmentation studies (arXiv 2024-2026).

Confidence: Industry-consensus.

Used in: RULE: Lead paragraphs with the direct answer. Aim for 40–60 words. Make every paragraph self-contained..