Knowledge-base-backed website methodology
Overview
Knowledge-base-backed website methodology is the practice of treating a public-facing website as the visible tip of a structured, sourced, dated, version-controlled internal research corpus, rather than as a self-contained marketing artifact. The marketing page is the last stage of a five-stage pipeline, not the first. The same markdown files that power private research generate the public site; every claim is one click away from its citation; every artifact carries a visible dateModified.
The methodology emerged from the convergence of three patterns that previously lived in separate disciplines: the Zettelkasten / digital-garden tradition (atomic notes, dense linking, evergreen status), the documentation-as-product movement (Stripe, Twilio, Anthropic), and the citation-grade publishing infrastructure of OpenAlex, Semantic Scholar, Our World in Data, and the Stanford Encyclopedia of Philosophy. The 2025-2026 addition is that the same KB which serves human readers and AI crawlers can also be queried by the writer's own RAG-assisted AI assistant, making foundation research operationally cheaper than it was a decade ago.
The methodology is encyclopedic and neutral in register. It treats Verified / Industry-consensus / Single-source / Speculative confidence labels as first-class metadata. It assumes the public site is read by future-self, internal team, AI agents, prospects, peers, and journalists simultaneously — and it layers content to serve each audience without flattening the others. See Research brief: The knowledge-base-backed website (piece 3 of 15) for the longer-form research brief, Research brief: Research Before Pages — methodology for KB-backed websites (piece 14 of 15) for the canonical argument against starting at the marketing page, and Research brief: Structured content as a competitive advantage (piece 2 of 15) for the structural-content thesis.
Research-first workflow: stages 0–5
The sequence is the methodology. Skipping Stage 1 and starting at Stage 3 is what produces content that does not hold up.
Stage 0 — Capture. Reading inbox plus writing inbox (Matuschak terminology). Transient notes, prompts, quotes, observations. Tool: plain markdown, an Obsidian daily note, or a memo app. No discipline imposed; the only requirement is that captured material does not get lost.
Stage 1 — Foundation research (internal). Atomic concept notes, one idea each, densely linked. Every claim sourced; every source dated. Confidence label inline: Verified / Industry-consensus / Single-source / Speculative. The audience is future self, the internal team, and AI agents indexing the KB. The output is a knowledge base, not a draft.
Stage 2 — Synthesis / outline. Cluster atomic notes around a public question. Decide what stays internal (the kill-your-darlings step). Identify the single argument the article will make. The output is an annotated outline that points at notes, not paragraphs.
Stage 3 — Public article (derived). Narrative draft for prospects and peers. Shorter than the research; one argument; named sources. Confidence still visible but smoothed into prose. The article links back into the public-facing portion of the KB. The output is a versioned, dated article with a visible last updated stamp.
Stage 4 — Marketing page. Brief; problem/outcome framing. Two or three sentences of substantive claim. Each claim linked to the article that defends it. Each article linked to the underlying research. The output is a short page whose credibility is one click deep.
Stage 5 — Maintain (living document). Quarterly review: which research has new sources? Which articles need a refreshed timestamp? Which marketing pages now point at stale defenses? Changelog visible to readers and crawlers.
The Candid KB itself is the operational realization of Stage 1. Each entry is an atomic concept note. The seeder library at [[scripts/lib/kb-seed.js]] is the infrastructure. Future public Candid articles get derived from this KB; the Candid marketing site links back into it.
Rule: foundation research before article, article before marketing page
All Candid Creative content for client work and Candid's own marketing follows the research-first sequence: Stage 1 (foundation research, atomic notes, fully sourced and confidence-labeled) → Stage 2 (synthesis outline) → Stage 3 (public article, derived) → Stage 4 (marketing page, linked to article). Never reverse the sequence; never start at Stage 4 and retrofit citations.
Why the rule holds:
- Research-heavy content written backward from a marketing brief is still backward (Research brief: Research Before Pages — methodology for KB-backed websites (piece 14 of 15)).
- The KB-as-RAG-corpus shift (2025-2026) means foundation research is now infrastructure for the writer's own AI assistant, not just a paper trail.
- Casey Newton's April 2026 retreat (
[[newton-platformer-retreat-from-daily-april-2026]]) is the strongest single signal that AI commoditizes daily synthesis — original depth is the only durable advantage. - Confidence labels work only when applied at the research stage; retrofitting them after the marketing copy is written is editorial theater, not discipline.
How to apply:
- Every client engagement starts with foundation research notes (atomic, sourced, dated, confidence-labeled) in a shared Obsidian vault or the Candid KB.
- Articles are commissioned with explicit reference to the foundation notes they derive from — not from a topic brief.
- Marketing pages link to articles; articles link to research notes; research notes link to primary sources. Credibility is one click deep, always.
- Diátaxis mode-check before drafting any public article (Diátaxis (Daniele Procida): four documentation types — tutorials, how-to, reference, explanation): tutorial / how-to / reference / explanation. Mixing modes inside one page is the most common content failure.
Cadence recommendation 2026
Realistic cadence for research-first content operations, by operator profile (2026):
| Operation type | Realistic cadence | Reference point |
|---|---|---|
| Solo, primary focus elsewhere | 1 piece/month | Bits about Money (Patrick McKenzie self-describes as "roughly monthly") |
| Solo, primary work | 1 piece/week | Construction Physics (Brian Potter's stated target: "every, every week or so") |
| Small team (2-4 people) | 1 deep piece/week + ongoing notes | Anthropic Science blog (Features + Workflows + Field Notes) |
| Quarterly long-form house | 4-6 major pieces/year | Works in Progress (shifting to bimonthly print) |
For a small service business operating research-first, the realistic answer is one public article every 2-4 weeks, with foundation-research notes drafted continuously in the background.
The trap is benchmarking against high-volume content marketing (HubSpot's 16+ posts/month threshold, discussed below) and concluding research-first "doesn't work." It works on a different axis: per-piece authority, AI citation likelihood, and conversion among readers who already trust the category. The right metric is not posts per month. It is whether the next piece written makes the previous one more valuable, or just adds another file to the same folder.
Tooling stack 2026
Recommended default stack for a small operation:
- Capture + foundation research: Obsidian (local-first, markdown, free personal use, $50/user/year commercial). Bidirectional linking and graph view are the operational expression of "compounding research." Plain markdown means no vendor lock-in.
- Public knowledge base: static site built from the same markdown files — practitioners use Eleventy, Astro, Next.js, or Quartz (designed for Obsidian-to-web publishing). Gwern.net is the high-water mark (Pandoc + Hakyll). The public KB is generated from the same source files as the private research, not maintained separately.
- Drafting articles: whatever the writer already uses, with foundation notes open in a second window. Some draft directly in Obsidian; others use iA Writer, Ulysses, or Google Docs and paste back.
- Marketing pages: the existing CMS, with explicit link-outs to the KB. The KB is the substrate the marketing site rests on, not its replacement.
Why not Notion alone: strong for team collaboration and structured databases; weaker for dense-linking atomic notes. Many practitioners use Notion for operations plus Obsidian for thinking.
Why not Roam or Logseq: both credible. Roam pioneered bidirectional-link UX; Logseq is open-source and outline-first. Choose based on whether you think in pages (Obsidian) or outlines (Logseq).
The AI layer (2026 reality): Smart Connections, Copilot for Obsidian, and similar plugins let a writer query their own KB conversationally. This is the most material change since 2023: a research-first KB is now also a working RAG corpus for the writer's own AI assistant.
Non-negotiables regardless of tool:
- Plain text (markdown) under version control (Git).
- Dated, sourced atomic notes.
- Linking discipline rather than folder discipline.
- Visible
last updatedtimestamp on every public artifact.
The Candid KB itself uses the Postgres + Express + server-rendered HTML variant — see [[project-candid-kb]] for that infrastructure. The Obsidian recommendation here is for client work and for any solo operator following the methodology.
Named practitioners 2026
Eight named exemplars of research-first / docs-as-product methodology, each illustrating a different facet:
- Stripe — Documentation as product, not marketing artifact. Custom tooling (Markdoc), documentation embedded in engineering job ladders, "docs are part of done." See the dedicated section below.
- Twilio — Rebuilt its docs platform on Next.js with an explicit "docs as code" philosophy: source-controlled, peer-reviewed, preview-deployed. Same Git workflow as the product code. https://twilio.com/en-us/blog/developers/new-era-for-twilio-documentation
- Anthropic Science blog + Alignment Science blog — Three-format structure: Features (specific results), Workflows (practical guides), Field Notes (interim findings published explicitly as in-progress research). The public-facing tip of a much larger internal corpus.
- Gwern.net — The high-water mark of "stable long-term essays which improve over time." Every essay has creation plus last-modified dates; essays versioned across years; design assumes decades. See
[[longevity-named-examples-2026]]. - Andy Matuschak (notes.andymatuschak.org) — The exemplar of "work with the garage door up": private research notes published publicly with explicit context that they were not written for the reader. See Andy Matuschak: evergreen notes — atomic, concept-oriented, densely linked, accreting over time.
- Maggie Appleton — The digital-garden model in practice: seedling/budding/evergreen status labels on each note; clear epistemic posture; the public site is the literal substrate of the published essays.
- Bits about Money (Patrick McKenzie) — Monthly long-form business writing built on visible domain depth. Each piece anchored in primary-source familiarity, not retrofitted citations.
- Construction Physics (Brian Potter) — Weekly engineering-economics essays each anchored in primary sources (industry reports, BLS data, historical engineering texts). Notes are not public, but citation density makes the research substrate visible.
Honorable mentions: Works in Progress (bimonthly print plus online, commissioned essays with editorial fact-checking); Stratechery (Ben Thompson's daily/weekly split is a hybrid — research-heavy weekly Articles supported by quicker daily Updates); The New Yorker (not research-first in publication order but research-first in verification order — same operating principle, applied at the back of the workflow).
Stripe docs as product
Stripe treats documentation as a first-class product. Engineers' documentation contributions are counted in performance reviews. The open-source Markdoc framework backs interactive features. Former Head of Docs Dave Nunez instituted writing classes for ESL engineers and a "doc star of the week" program.
Quote (Patrick McKenzie, on Stripe): "Stripe is a celebration of the written word which happens to be incorporated in the state of Delaware."
Sources: Mintlify blog — https://mintlify.com/blog/stripe-docs; apidog.com analysis.
Confidence: Industry-consensus (multiple secondary sources, named-individual primary statements).
Stripe is the strongest commercial proof that "docs as a product" is a competitive moat. The pattern: structured + searchable + versioned + cross-linked + treated with the same engineering rigor as the codebase. Stripe — not personal digital gardens — is the lead example when arguing the methodology to commercial decision-makers.
Astro Content Collections + Zod
Astro's Content Collections (stable since Astro 2.0; Content Layer API in Astro 5, late 2024) use Zod schemas for build-time validation of Markdown/MDX frontmatter, with auto-generated TypeScript types. The Astro team's own astro.build website uses Zod-validated Content Collections for blog, integrations, showcase, and authors — with refinements that explicitly reject scraping-error strings like "Just a moment…" in titles.
Sources: https://docs.astro.build/en/guides/content-collections; deepwiki.com/withastro/astro.build.
Confidence: Verified.
Astro Content Collections — or the equivalent in any framework: gray-matter + Zod, MDX-on-Next.js, Contentlayer — is the structural mechanism that makes a KB-shaped marketing site practical. Schema validation at build time is the structural equivalent of Wikipedia's verifiability policy: it makes missing sources, dates, or confidence flags impossible to ship. This is the technical pattern Candid uses on candidcreative.ca.
OpenAlex and Semantic Scholar citation graphs
Two open scholarly knowledge graphs operate at scale:
- OpenAlex (OurResearch nonprofit) — 271.3 million works as of November 2025; launched 2022 to replace the discontinued Microsoft Academic Graph.
- Semantic Scholar (Allen Institute for AI) — over 214 million publications, 80M+ authors, 2.4B+ citation edges in its 2023 snapshot per S2AG paper (arXiv:2301.10140).
Sources: Codina et al. (ResearchGate analysis) for OpenAlex; S2AG paper arXiv:2301.10140 for Semantic Scholar.
Confidence: Verified.
Open citation milestone: open-citation data crossed 1 billion public-domain citations in February 2021, marking what Hutchins called an irreversible shift away from closed commercial providers (arXiv:2106.04695).
These figures matter for the methodology because they prove "citation graphs" are mature, working infrastructure — not a science-fiction concept. The agency methodology is borrowing a pattern that already operates at 271M-node scale. A 100-entry KB applying the same structural primitives (atomic units, typed links, dated provenance) is operating on the same axis as OpenAlex, just at a different magnitude.
Our World in Data: open provenance
Our World in Data (Oxford-based) publishes under CC-BY, exposes structured CSV plus JSON metadata endpoints per indicator, and publishes its data-processing code as open source on GitHub. Every chart on the site has a downloadable underlying dataset.
Source: https://ourworldindata.org/faqs
Confidence: Verified.
Our World in Data is the strongest non-corporate KB exemplar because provenance is enforced by infrastructure. A publisher cannot publish a chart without a dataset; cannot publish a dataset without a citation; the code that processed the data is public. The same discipline applied to a marketing KB would force every claim to carry its source row.
Stanford Encyclopedia of Philosophy: permanent archives
The Stanford Encyclopedia of Philosophy maintains permanent dated "archived editions" of every entry so scholars can cite a specific snapshot in time.
Source: https://plato.stanford.edu/cite.html
Confidence: Verified (primary).
Most CMS-backed marketing sites overwrite history on every edit. A KB-backed site can preserve dated versions — either via Git history surfaced as URLs, via Docusaurus-style version snapshots, or via a dateModified plus change log per entry. The Stanford pattern is the strongest example of citation-grade content discipline.
The Candid KB's posture is aligned: the schema captures created_at, updated_at, and per-update history via trigger (audit table). Slug renames preserve old aliases. Soft-delete only. The infrastructure for SEP-style citation discipline is already in place; what's needed is the editorial habit.
Newton / Platformer retreat from daily cadence (April 2026)
In April 2026, Casey Newton publicly retreated from Platformer's daily cadence with the framing: "More scoops, less aggregation and analysis."
Source: Nieman Lab, April 2026.
Confidence: Verified.
This is the strongest single data point against the daily-cadence model. Newton is the named exemplar of credible daily tech newsletter cadence post-Stratechery. His explicit reason for the shift is AI commoditizing daily synthesis — the work that powered the model has been undercut by GPT, Claude, and Gemini doing equivalent aggregation in seconds.
The implication for clients of a research-first agency: the "ship daily" doctrine that worked for Godin, Levine, and Stratechery in the 2010s through early 2020s has a 2026 expiration. The replacement is not "ship less" — it is ship work AI engines cannot easily replicate: original research, named sources, dated claims, structured depth. Pairs with Dan Taylor (SE Land, Jan 13 2026, n=107,352): CWV is a gate for AI citation, not a growth lever and [[profound-680m-citations-perplexity-citation-behavior]].
HubSpot 2015 volume study: 16+ posts / 3.5× traffic
HubSpot's 2015 study of 13,500+ customers found that "companies that published 16+ blog posts per month got almost 3.5× more inbound traffic than companies that published between 0-4 monthly posts."
Source: HubSpot 2015 inbound study (widely re-circulated since).
Confidence: Industry-consensus on direction; dated dataset — almost certainly does not survive 2024-2026 Google E-E-A-T updates intact in magnitude. Use as steel-man material, not current benchmark.
Companion 2025 data point: Stratabeat's B2B SaaS data — "websites that published 9+ blog posts per month saw a 20.1% increase in monthly organic traffic." Same direction, single-source, more recent.
The honest reading: volume-correlation data points one direction; the AI-citation and trust dimensions point the other. Research-first competes on the trust/citation axis. Both can be true: a research-first SMB writing 1 article every 2-4 weeks captures different value than a content factory writing 16+ per month.
HubSpot historical-optimization study (Pamela Vaughan): +106% on refreshed posts
Quote (HubSpot, Pamela Vaughan historical-optimization study):
"We've increased the number of monthly organic search views of old posts we've optimized by an average of 106%."
Source: https://blog.hubspot.com/marketing/historical-blog-seo-conversion-optimization
Confidence: Verified (HubSpot first-party).
The compounding case: the same content that is "dying" (Patrick Stox, Ahrefs: "Every piece of content you've ever published is slowly dying") can be resurrected with structured refresh. This is what separates a foundation-first site (where the asset compounds via refresh) from a rebuild-every-3-years site (where every refresh discards the asset). Pairs with [[rule-publish-and-last-updated-dates-mandatory]] and Seer Interactive (Oct 2025): 65% of AI bot hits target content under 1 year old; 89% under 3 years.
RAG: when it pays off (50-page threshold)
Retrieval-augmented generation on a marketing site only pays off above approximately 50 pages of substantive content AND when visitor questions do not map cleanly to navigation.
Threshold conditions for RAG payoff:
- 50+ pages of substantive content (knowledge base, docs, case studies).
- User questions do not map cleanly to navigation (complex services, regulated industries).
- Client has high support-ticket volume that could be deflected.
When traditional search wins:
- Under 50 pages.
- Strongly structured content (products with SKUs, locations, services with names).
- Budget under $5k for the search feature alone.
Default RAG stack for SMB:
- Embedding DB: pgvector in Postgres (Supabase, Neon, or self-hosted). Per Encore's pgvector benchmarks: "pgvector handles millions of vectors with HNSW indexing. Benchmarks show query times under 20ms at 1M vectors with recall rates above 95%."
- Don't reach for Pinecone, Weaviate, or Turbopuffer for SMB work — they are real wins above 10M vectors or with extreme multi-tenant requirements, neither of which apply to local-service businesses.
- Embedding model: OpenAI
text-embedding-3-small(1536 dim, cheap) or Voyage AI for higher quality. - LLM for generation: Claude Haiku for cost, Sonnet for quality.
- Frontend: server-rendered chat island in Astro, streaming response via Cloudflare Worker.
For traditional search (the preferred default below the threshold): Pagefind (free, runs at build time, sub-300kB index for most SMB sites), Orama (in-browser, no backend), MiniSearch (lightweight JS), Algolia (when typo tolerance or analytics matter), Meilisearch (self-hosted, faceted).
Audience layering: foundation → article → marketing
The three layers of a research-first content operation, side by side:
| Dimension | Foundation Research | Public Article | Marketing Page |
|---|---|---|---|
| Primary audience | Future self, internal team, AI agents indexing the KB | Prospects evaluating expertise; peers; journalists | Prospects making a decision |
| Implied secondary | None — written as if no one is watching | Skim-readers, AI engines, search crawlers | Sales conversations, AI shopping assistants |
| Length | Atomic notes 50-500 words each; clusters 5,000+ | 1,200-3,500 words | 200-600 words |
| Tone | Telegraphic, technical, hedged with confidence labels | Narrative, opinionated, conversational | Direct, outcome-framed |
| Source treatment | Every claim cited inline with URL + date + confidence | Named sources in prose; key citations linked; confidence smoothed | One or two anchor citations; rest deferred to linked article |
| Update behavior | Continuous; notes evolve | Versioned with visible last updated |
Rewritten when underlying article changes materially |
| Failure mode if skipped | Articles repeat received wisdom; no compounding | Marketing pages assert without defending | Articles get traffic but no conversion or trust |
| Closest analog | A researcher's lab notebook; a Zettelkasten | A New Yorker feature; Bits about Money | Stripe docs landing page |
The Candid KB occupies the Foundation Research column. The Candid public site (candidcreative.ca) occupies the Marketing Page column. The Article column is the missing layer — the place where Candid writing for prospects gets derived from KB research. Filling that layer is the architectural argument that closes the methodology.
Foundation roadmap: the 15-piece closure
The 15-piece research-brief roadmap forms a system, not a list. End to end:
- Pieces 1-4 (strategic frame): Brief 2 (Research brief: What makes a marketing site do something (piece on brochure vs platform)) sets the negative case — most marketing sites do nothing. Brief 3 (Research brief: The knowledge-base-backed website (piece 3 of 15)) plus Brief 1 (Research brief: Structured content as a competitive advantage (piece 2 of 15)) set the positive case — the site as platform, not pamphlet. Brief 6 (Research brief: Information architecture for service businesses with multiple verticals (piece 6 of 15)) — including the 2026 SEO/GEO/Local Search Playbook for Kitchener-Waterloo — defines the competitive environment.
- Pieces 5-7 (infrastructure layer): Brief 4 (Research brief: Owning your stack — why agency-managed platforms cost more than they save (piece 4 of 15)) on stack ownership; Brief 6 (Research brief: Information architecture for service businesses with multiple verticals (piece 6 of 15)) on IA; Brief 9 (Research brief: Page Speed as a Moat — why CWV separates the agencies from the freelancers (piece 9 of 15)) on page speed. The engineering substrate. Without these, the discipline of pieces 8-15 has no durable home.
- Pieces 8-9 (build philosophy): Brief 5 (Research brief: Built to Last — why most SMB sites rebuild every 3-4 years (piece 5 of 15)) plus Brief 10 (Research brief: The Case Against Page Builders (piece 10 of 15)) defend platform longevity — the discipline only pays off over time, on infrastructure that survives.
- Pieces 10-11 (content asset): Brief 12 (Research brief: The Dataset is the Product — when a service business should own its data (piece 12 of 15)) plus Brief 11 (Research brief: Public data as a private moat — building proprietary intelligence from government open data (piece 11 of 15)) make the case that structured, sourced data is the durable competitive asset of a modern services business.
- Pieces 12-14 (operationalize the asset): including Brief 14 (Research brief: Research Before Pages — methodology for KB-backed websites (piece 14 of 15)) — the research-first methodology.
- Piece 15 (the closing piece): the editorial layer that turns all of the above from a content-strategy story into a credibility story. Every piece in the roadmap makes claims; this piece is the one that says how those claims must be made.
The throughline
A credible SMB website in 2026 is a structured, owned, durable platform whose content is verifiable by both human readers and AI engines.
Each of the 15 pieces is one wall of that house. Piece 15 is the foundation under all of them — which is why it is piece 15 rather than piece 1: it can only be specified once the rest of the system is described, but it underpins the rest in practice.
What this means in operation
- Every public article derives from the KB, using the
[[confidence-label-taxonomy-7-label-2026]]. - Every client engagement starts at Stage 1 of the research-first workflow above.
- Every claim in client deliverables follows
[[what-to-source-checklist]]. - Every published artifact carries a
dateModifiedand is subject to[[rule-visible-last-updated-stamp-on-public-artifact]]. - Every correction follows
[[retraction-correction-playbook]]. - The Candid public site is the demonstration of the methodology — the marketing for the agency IS the operationalization of the playbook.
Related frameworks
The methodology pulls together several established frameworks that originated outside marketing:
- Matuschak evergreen notes — See Andy Matuschak: evergreen notes — atomic, concept-oriented, densely linked, accreting over time. Notes are written to be permanent (refactored over time, not appended), atomic (one idea per note), densely linked, and concept-oriented rather than topic-oriented. The Foundation Research column of the layering table is operationally the same artifact as a Matuschak-style evergreen note.
- Diátaxis framework (Procida) — See Diátaxis (Daniele Procida): four documentation types — tutorials, how-to, reference, explanation. Four documentation modes — tutorial, how-to, reference, explanation — each with a distinct user need, voice, and structure. The pre-draft mode check is the single highest-leverage editorial discipline in the methodology; mixing modes inside one page is the most common failure pattern in marketing-site content.
- Stripe docs-as-product — Documentation contributions in engineering performance reviews; Markdoc as in-house tooling; structured plus versioned plus cross-linked plus engineered to the same standard as the code. The commercial proof case.
- OpenAlex and Semantic Scholar citation graphs — Proof that typed, atomic, link-rich knowledge graphs operate at 271M-node scale.
- Our World in Data — Proof that provenance can be enforced by infrastructure rather than goodwill.
- Stanford Encyclopedia of Philosophy — Proof that permanent dated editions are achievable and citable.
Sources and confidence
- Research-first workflow (Stages 0–5) — internal framework; synthesises Matuschak's writing-inbox terminology, the Zettelkasten tradition, and standard documentation-pipeline conventions. Confidence: Internal-framework / Industry-consensus on each component.
- Cadence reference points — Bits about Money (Patrick McKenzie self-described as "roughly monthly"); Construction Physics (Brian Potter, "every, every week or so"); Anthropic Science blog (Features / Workflows / Field Notes structure); Works in Progress (bimonthly print). Confidence: Verified (primary stated cadences).
- Tooling stack — Obsidian pricing and licensing (verified); Quartz designed for Obsidian-to-web (verified); Gwern.net on Pandoc + Hakyll (verified); Smart Connections / Copilot for Obsidian as the 2025-2026 RAG layer (verified). Confidence: Verified.
- Stripe docs — Mintlify blog https://mintlify.com/blog/stripe-docs; apidog.com analysis; Patrick McKenzie quote. Confidence: Industry-consensus (multiple secondary sources; named-individual primary statements).
- Astro Content Collections — https://docs.astro.build/en/guides/content-collections; deepwiki.com/withastro/astro.build for the astro.build site implementation. Confidence: Verified.
- OpenAlex — Codina et al. (ResearchGate analysis); 271.3M works as of November 2025; launched 2022 to replace the discontinued Microsoft Academic Graph. Confidence: Verified.
- Semantic Scholar — S2AG paper, arXiv:2301.10140; 214M+ publications, 80M+ authors, 2.4B+ citation edges in 2023 snapshot. Confidence: Verified.
- Open citation milestone — Hutchins, arXiv:2106.04695; 1 billion public-domain citations crossed February 2021. Confidence: Verified.
- Our World in Data — https://ourworldindata.org/faqs; CC-BY licensing, per-indicator JSON/CSV endpoints, GitHub-published processing code. Confidence: Verified.
- Stanford Encyclopedia of Philosophy — https://plato.stanford.edu/cite.html; permanent dated archived editions. Confidence: Verified (primary).
- Casey Newton / Platformer retreat — Nieman Lab, April 2026; "More scoops, less aggregation and analysis." Confidence: Verified.
- HubSpot 2015 volume study — 13,500+ customers; 16+ posts per month → "almost 3.5×" inbound traffic vs. 0–4 posts. Confidence: Industry-consensus on direction; dated dataset; use as steel-man material, not current benchmark.
- Stratabeat 2025 companion — 9+ posts per month → 20.1% organic traffic lift, B2B SaaS sample. Confidence: Single-source.
- HubSpot historical-optimization study (Pamela Vaughan) — https://blog.hubspot.com/marketing/historical-blog-seo-conversion-optimization; average 106% lift in monthly organic search views on optimized old posts. Confidence: Verified (HubSpot first-party).
- Patrick Stox / Ahrefs content-decay framing — "Every piece of content you've ever published is slowly dying." Confidence: Industry-consensus.
- RAG threshold and stack — Encore's pgvector benchmarks: "pgvector handles millions of vectors with HNSW indexing. Benchmarks show query times under 20ms at 1M vectors with recall rates above 95%." Confidence: Verified (vendor-published benchmark); 50-page payoff threshold is Internal-framework / Industry-consensus.
- Twilio docs-as-code — https://twilio.com/en-us/blog/developers/new-era-for-twilio-documentation. Confidence: Verified (primary).
For the longer-form research underlying this page, see Research brief: The knowledge-base-backed website (piece 3 of 15), Research brief: Research Before Pages — methodology for KB-backed websites (piece 14 of 15), and Research brief: Structured content as a competitive advantage (piece 2 of 15). For the two governing frameworks the methodology depends on, see Andy Matuschak: evergreen notes — atomic, concept-oriented, densely linked, accreting over time and Diátaxis (Daniele Procida): four documentation types — tutorials, how-to, reference, explanation.