Extractability: a quotable paragraph leads with the answer, is 40-60 words, lives under semantic HTML, and names entities concretely

reference · Scope: business · Status: current

Created 2026-05-22

Synthesis of GEO paper + Ahrefs content-helper findings + RAG chunking literature + the Digital Bloom 2025 report:

A paragraph is extractable when it:

Leads with the answer, not the setup
Is self-contained (a reader landing here understands it without prior context)
Names entities concretely (proper nouns, dates, statistics, prices)
Cites a source with a named author or institution
Lives in clear semantic HTML — under a meaningful <h2>/<h3>, not a carousel <div>
Is 40-60 words (matches featured-snippet research showing 45-word paragraph snippets appear most frequently)
Includes a statistic or a direct quotation

A paragraph is invisible when it: hedges ("could", "may"); references "above" / "below" (chunkers strip context); names entities vaguely ("several companies say…"); lives in a JS-rendered component or PDF without text fallback; duplicates across many pages without unique facts.

Sources: Princeton GEO paper (see Princeton GEO paper (Aggarwal et al., KDD '24) — the foundational generative engine optimization study); Digital Bloom 2025 AI Citation & LLM Visibility Report; averi.ai LLM-Optimized Content Structures guide; HiChunk / W-RAC / hierarchical segmentation studies (arXiv 2024-2026).

Confidence: Industry-consensus.

Used in: [[rule-lead-with-answer-40-60-words]].

reference Princeton GEO paper (Aggarwal et al., KDD '24) — the foundational generative engine optimization study

Referenced by (1)

rule RULE: Lead paragraphs with the direct answer. Aim for 40–60 words. Make every paragraph self-contained. · depends-on

Related

Referenced by (1)