Knowledge-base-backed website methodology

reference · Scope: marketing-site · Status: current

Created 2026-06-11 · Updated 2026-06-25

Summary

Overview

Knowledge-base-backed website methodology is the practice of treating a public-facing website as the visible tip of a structured, sourced, dated, version-controlled internal research corpus, rather than as a self-contained marketing artifact. The marketing page is the last stage of a five-stage pipeline, not the first. The same markdown files that power private research generate the public site; every claim is one click away from its citation; every artifact carries a visible dateModified.

The methodology emerged from the convergence of three patterns that previously lived in separate disciplines: the Zettelkasten / digital-garden tradition (atomic notes, dense linking, evergreen status), the documentation-as-product movement (Stripe, Twilio, Anthropic), and the citation-grade publishing infrastructure of OpenAlex, Semantic Scholar, Our World in Data, and the Stanford Encyclopedia of Philosophy. The 2025-2026 addition is that the same KB which serves human readers and AI crawlers can also be queried by the writer's own RAG-assisted AI assistant, making foundation research operationally cheaper than it was a decade ago.

The methodology is encyclopedic and neutral in register. It treats Verified / Industry-consensus / Single-source / Speculative confidence labels as first-class metadata. It assumes the public site is read by future-self, internal team, AI agents, prospects, peers, and journalists simultaneously — and it layers content to serve each audience without flattening the others.

Research-first workflow: stages 0–5

The sequence is the methodology. Skipping Stage 1 and starting at Stage 3 is what produces content that does not hold up.

Stage 0 — Capture. Reading inbox plus writing inbox (Matuschak terminology). Transient notes, prompts, quotes, observations. Tool: plain markdown, an Obsidian daily note, or a memo app. No discipline imposed; the only requirement is that captured material does not get lost.

Stage 1 — Foundation research (internal). Atomic concept notes, one idea each, densely linked. Every claim sourced; every source dated. Confidence label inline: Verified / Industry-consensus / Single-source / Speculative. The audience is future self, the internal team, and AI agents indexing the KB. The output is a knowledge base, not a draft.

Stage 2 — Synthesis / outline. Cluster atomic notes around a public question. Decide what stays internal (the kill-your-darlings step). Identify the single argument the article will make. The output is an annotated outline that points at notes, not paragraphs.

Stage 3 — Public article (derived). Narrative draft for prospects and peers. Shorter than the research; one argument; named sources. Confidence still visible but smoothed into prose. The article links back into the public-facing portion of the KB. The output is a versioned, dated article with a visible last updated stamp.

Stage 4 — Marketing page. Brief; problem/outcome framing. Two or three sentences of substantive claim. Each claim linked to the article that defends it. Each article linked to the underlying research. The output is a short page whose credibility is one click deep.

Stage 5 — Maintain (living document). Quarterly review: which research has new sources? Which articles need a refreshed timestamp? Which marketing pages now point at stale defenses? Changelog visible to readers and crawlers.

Each entry in a foundation KB is an atomic concept note; a small seeder library that ingests markdown files into the KB store is the infrastructure layer that makes Stage 1 operationally sustainable.

Rule: foundation research before article, article before marketing page

The research-first sequence applies end-to-end: Stage 1 (foundation research, atomic notes, fully sourced and confidence-labeled) → Stage 2 (synthesis outline) → Stage 3 (public article, derived) → Stage 4 (marketing page, linked to article). Never reverse the sequence; never start at Stage 4 and retrofit citations.

Why the rule holds:

Research-heavy content written backward from a marketing brief is still backward.
The KB-as-RAG-corpus shift (2025-2026) means foundation research is now infrastructure for the writer's own AI assistant, not just a paper trail.
Casey Newton's April 2026 retreat from daily cadence is the strongest single signal that AI commoditizes daily synthesis — original depth is the only durable advantage.
Confidence labels work only when applied at the research stage; retrofitting them after the marketing copy is written is editorial theater, not discipline.

How to apply:

Every engagement starts with foundation research notes (atomic, sourced, dated, confidence-labeled) in a shared Obsidian vault or a comparable KB store.
Articles are commissioned with explicit reference to the foundation notes they derive from — not from a topic brief.
Marketing pages link to articles; articles link to research notes; research notes link to primary sources. Credibility is one click deep, always.
Diátaxis mode-check (Daniele Procida's four-mode framework — tutorial / how-to / reference / explanation) before drafting any public article. Mixing modes inside one page is the most common content failure.

Cadence recommendation 2026

Realistic cadence for research-first content operations, by operator profile (2026):

Operation type	Realistic cadence	Reference point
Solo, primary focus elsewhere	1 piece/month	Bits about Money (Patrick McKenzie self-describes as "roughly monthly")
Solo, primary work	1 piece/week	Construction Physics (Brian Potter's stated target: "every, every week or so")
Small team (2-4 people)	1 deep piece/week + ongoing notes	Anthropic Science blog (Features + Workflows + Field Notes)
Quarterly long-form house	4-6 major pieces/year	Works in Progress (shifting to bimonthly print)

For a small service business operating research-first, the realistic answer is one public article every 2-4 weeks, with foundation-research notes drafted continuously in the background.

The trap is benchmarking against high-volume content marketing (HubSpot's 16+ posts/month threshold, discussed below) and concluding research-first "doesn't work." It works on a different axis: per-piece authority, AI citation likelihood, and conversion among readers who already trust the category. The right metric is not posts per month. It is whether the next piece written makes the previous one more valuable, or just adds another file to the same folder.

Tooling stack 2026

Recommended default stack for a small operation:

Capture + foundation research: Obsidian (local-first, markdown, free personal use, $50/user/year commercial). Bidirectional linking and graph view are the operational expression of "compounding research." Plain markdown means no vendor lock-in.
Public knowledge base: static site built from the same markdown files — practitioners use Eleventy, Astro, Next.js, or Quartz (designed for Obsidian-to-web publishing). Gwern.net is the high-water mark (Pandoc + Hakyll). The public KB is generated from the same source files as the private research, not maintained separately.
Drafting articles: whatever the writer already uses, with foundation notes open in a second window. Some draft directly in Obsidian; others use iA Writer, Ulysses, or Google Docs and paste back.
Marketing pages: the existing CMS, with explicit link-outs to the KB. The KB is the substrate the marketing site rests on, not its replacement.

Why not Notion alone: strong for team collaboration and structured databases; weaker for dense-linking atomic notes. Many practitioners use Notion for operations plus Obsidian for thinking.

Why not Roam or Logseq: both credible. Roam pioneered bidirectional-link UX; Logseq is open-source and outline-first. Choose based on whether you think in pages (Obsidian) or outlines (Logseq).

The AI layer (2026 reality): Smart Connections, Copilot for Obsidian, and similar plugins let a writer query their own KB conversationally. This is the most material change since 2023: a research-first KB is now also a working RAG corpus for the writer's own AI assistant.

Non-negotiables regardless of tool:

Plain text (markdown) under version control (Git).
Dated, sourced atomic notes.
Linking discipline rather than folder discipline.
Visible last updated timestamp on every public artifact.

A Postgres + Express + server-rendered HTML variant is one viable alternative to the Obsidian-to-static-site path; the Obsidian recommendation here is the default for solo operators following the methodology.

Named practitioners 2026

Eight named exemplars of research-first / docs-as-product methodology, each illustrating a different facet:

Stripe — Documentation as product, not marketing artifact. Custom tooling (Markdoc), documentation embedded in engineering job ladders, "docs are part of done." See the dedicated section below.
Twilio — Rebuilt its docs platform on Next.js with an explicit "docs as code" philosophy: source-controlled, peer-reviewed, preview-deployed. Same Git workflow as the product code. https://twilio.com/en-us/blog/developers/new-era-for-twilio-documentation
Anthropic Science blog + Alignment Science blog — Three-format structure: Features (specific results), Workflows (practical guides), Field Notes (interim findings published explicitly as in-progress research). The public-facing tip of a much larger internal corpus.
Gwern.net — The high-water mark of "stable long-term essays which improve over time." Every essay has creation plus last-modified dates; essays versioned across years; design assumes decades.
Andy Matuschak (notes.andymatuschak.org) — The exemplar of "work with the garage door up": private research notes published publicly with explicit context that they were not written for the reader. Originator of the "evergreen notes" formulation — notes written to be permanent, atomic, densely linked, and concept-oriented rather than topic-oriented.
Maggie Appleton — The digital-garden model in practice: seedling/budding/evergreen status labels on each note; clear epistemic posture; the public site is the literal substrate of the published essays.
Bits about Money (Patrick McKenzie) — Monthly long-form business writing built on visible domain depth. Each piece anchored in primary-source familiarity, not retrofitted citations.
Construction Physics (Brian Potter) — Weekly engineering-economics essays each anchored in primary sources (industry reports, BLS data, historical engineering texts). Notes are not public, but citation density makes the research substrate visible.

Honorable mentions: Works in Progress (bimonthly print plus online, commissioned essays with editorial fact-checking); Stratechery (Ben Thompson's daily/weekly split is a hybrid — research-heavy weekly Articles supported by quicker daily Updates); The New Yorker (not research-first in publication order but research-first in verification order — same operating principle, applied at the back of the workflow).

Stripe docs as product

Stripe treats documentation as a first-class product. Engineers' documentation contributions are counted in performance reviews. The open-source Markdoc framework backs interactive features. Former Head of Docs Dave Nunez instituted writing classes for ESL engineers and a "doc star of the week" program.

Quote (Patrick McKenzie, on Stripe): "Stripe is a celebration of the written word which happens to be incorporated in the state of Delaware."

Sources: Mintlify blog — https://mintlify.com/blog/stripe-docs; apidog.com analysis.

Confidence: Industry-consensus (multiple secondary sources, named-individual primary statements).

Stripe is the strongest commercial proof that "docs as a product" is a competitive moat. The pattern: structured + searchable + versioned + cross-linked + treated with the same engineering rigor as the codebase. Stripe — not personal digital gardens — is the lead example when arguing the methodology to commercial decision-makers.

Astro Content Collections + Zod

Astro's Content Collections (stable since Astro 2.0; Content Layer API in Astro 5, late 2024) use Zod schemas for build-time validation of Markdown/MDX frontmatter, with auto-generated TypeScript types. The Astro team's own astro.build website uses Zod-validated Content Collections for blog, integrations, showcase, and authors — with refinements that explicitly reject scraping-error strings like "Just a moment…" in titles.

Sources: https://docs.astro.build/en/guides/content-collections; deepwiki.com/withastro/astro.build.

Confidence: Verified.

Astro Content Collections — or the equivalent in any framework: gray-matter + Zod, MDX-on-Next.js, Contentlayer — is the structural mechanism that makes a KB-shaped marketing site practical. Schema validation at build time is the structural equivalent of Wikipedia's verifiability policy: it makes missing sources, dates, or confidence flags impossible to ship.

OpenAlex and Semantic Scholar citation graphs

Two open scholarly knowledge graphs operate at scale:

OpenAlex (OurResearch nonprofit) — 271.3 million works as of November 2025; launched 2022 to replace the discontinued Microsoft Academic Graph.
Semantic Scholar (Allen Institute for AI) — over 214 million publications, 80M+ authors, 2.4B+ citation edges in its 2023 snapshot per S2AG paper (arXiv:2301.10140).

Sources: Codina et al. (ResearchGate analysis) for OpenAlex; S2AG paper arXiv:2301.10140 for Semantic Scholar.

Confidence: Verified.

Open citation milestone: open-citation data crossed 1 billion public-domain citations in February 2021, marking what Hutchins called an irreversible shift away from closed commercial providers (arXiv:2106.04695).

These figures matter for the methodology because they prove "citation graphs" are mature, working infrastructure — not a science-fiction concept. The agency methodology is borrowing a pattern that already operates at 271M-node scale. A 100-entry KB applying the same structural primitives (atomic units, typed links, dated provenance) is operating on the same axis as OpenAlex, just at a different magnitude.

Our World in Data: open provenance

Our World in Data (Oxford-based) publishes under CC-BY, exposes structured CSV plus JSON metadata endpoints per indicator, and publishes its data-processing code as open source on GitHub. Every chart on the site has a downloadable underlying dataset.

Source: https://ourworldindata.org/faqs

Confidence: Verified.

Our World in Data is the strongest non-corporate KB exemplar because provenance is enforced by infrastructure. A publisher cannot publish a chart without a dataset; cannot publish a dataset without a citation; the code that processed the data is public. The same discipline applied to a marketing KB would force every claim to carry its source row.

Stanford Encyclopedia of Philosophy: permanent archives

The Stanford Encyclopedia of Philosophy maintains permanent dated "archived editions" of every entry so scholars can cite a specific snapshot in time.

Source: https://plato.stanford.edu/cite.html

Confidence: Verified (primary).

Most CMS-backed marketing sites overwrite history on every edit. A KB-backed site can preserve dated versions — either via Git history surfaced as URLs, via Docusaurus-style version snapshots, or via a dateModified plus change log per entry. The Stanford pattern is the strongest example of citation-grade content discipline.

A KB store can align with this posture at the schema level: capture created_at, updated_at, and per-update history via an audit trigger; preserve old aliases on slug renames; soft-delete only. The infrastructure for SEP-style citation discipline is achievable; what is harder is the editorial habit.

Newton / Platformer retreat from daily cadence (April 2026)

In April 2026, Casey Newton publicly retreated from Platformer's daily cadence with the framing: "More scoops, less aggregation and analysis."

Source: Nieman Lab, April 2026.

Confidence: Verified.

This is the strongest single data point against the daily-cadence model. Newton is the named exemplar of credible daily tech newsletter cadence post-Stratechery. His explicit reason for the shift is AI commoditizing daily synthesis — the work that powered the model has been undercut by GPT, Claude, and Gemini doing equivalent aggregation in seconds.

The implication for any research-first operation: the "ship daily" doctrine that worked for Godin, Levine, and Stratechery in the 2010s through early 2020s has a 2026 expiration. The replacement is not "ship less" — it is ship work AI engines cannot easily replicate: original research, named sources, dated claims, structured depth.

HubSpot 2015 volume study: 16+ posts / 3.5× traffic

HubSpot's 2015 study of 13,500+ customers found that "companies that published 16+ blog posts per month got almost 3.5× more inbound traffic than companies that published between 0-4 monthly posts."

Source: HubSpot 2015 inbound study (widely re-circulated since).

Confidence: Industry-consensus on direction; dated dataset — almost certainly does not survive 2024-2026 Google E-E-A-T updates intact in magnitude. Use as steel-man material, not current benchmark.

Companion 2025 data point: Stratabeat's B2B SaaS data — "websites that published 9+ blog posts per month saw a 20.1% increase in monthly organic traffic." Same direction, single-source, more recent.

The honest reading: volume-correlation data points one direction; the AI-citation and trust dimensions point the other. Research-first competes on the trust/citation axis. Both can be true: a research-first SMB writing 1 article every 2-4 weeks captures different value than a content factory writing 16+ per month.

HubSpot historical-optimization study (Pamela Vaughan): +106% on refreshed posts

Quote (HubSpot, Pamela Vaughan historical-optimization study):

"We've increased the number of monthly organic search views of old posts we've optimized by an average of 106%."

Source: https://blog.hubspot.com/marketing/historical-blog-seo-conversion-optimization

Confidence: Verified (HubSpot first-party).

The compounding case: the same content that is "dying" (Patrick Stox, Ahrefs: "Every piece of content you've ever published is slowly dying") can be resurrected with structured refresh. This is what separates a foundation-first site (where the asset compounds via refresh) from a rebuild-every-3-years site (where every refresh discards the asset). Reinforces the rule that publish-date and last-updated-date are mandatory metadata on every public artifact.

RAG: when it pays off (50-page threshold)

Retrieval-augmented generation on a marketing site only pays off above approximately 50 pages of substantive content AND when visitor questions do not map cleanly to navigation.

Threshold conditions for RAG payoff:

50+ pages of substantive content (knowledge base, docs, case studies).
User questions do not map cleanly to navigation (complex services, regulated industries).
Client has high support-ticket volume that could be deflected.

When traditional search wins:

Under 50 pages.
Strongly structured content (products with SKUs, locations, services with names).
Project scope leaves no engineering room for a vector-search pipeline alongside the broader build.

Default RAG stack for SMB:

Embedding DB: pgvector in Postgres (Supabase, Neon, or self-hosted). Per Encore's pgvector benchmarks: "pgvector handles millions of vectors with HNSW indexing. Benchmarks show query times under 20ms at 1M vectors with recall rates above 95%."
Don't reach for Pinecone, Weaviate, or Turbopuffer for SMB work — they are real wins above 10M vectors or with extreme multi-tenant requirements, neither of which apply to local-service businesses.
Embedding model: OpenAI text-embedding-3-small (1536 dim, cheap) or Voyage AI for higher quality.
LLM for generation: Claude Haiku for cost, Sonnet for quality.
Frontend: server-rendered chat island in Astro, streaming response via Cloudflare Worker.

For traditional search (the preferred default below the threshold): Pagefind (free, runs at build time, sub-300kB index for most SMB sites), Orama (in-browser, no backend), MiniSearch (lightweight JS), Algolia (when typo tolerance or analytics matter), Meilisearch (self-hosted, faceted).

Audience layering: foundation → article → marketing

The three layers of a research-first content operation, side by side:

Dimension	Foundation Research	Public Article	Marketing Page
Primary audience	Future self, internal team, AI agents indexing the KB	Prospects evaluating expertise; peers; journalists	Prospects making a decision
Implied secondary	None — written as if no one is watching	Skim-readers, AI engines, search crawlers	Sales conversations, AI shopping assistants
Length	Atomic notes 50-500 words each; clusters 5,000+	1,200-3,500 words	200-600 words
Tone	Telegraphic, technical, hedged with confidence labels	Narrative, opinionated, conversational	Direct, outcome-framed
Source treatment	Every claim cited inline with URL + date + confidence	Named sources in prose; key citations linked; confidence smoothed	One or two anchor citations; rest deferred to linked article
Update behavior	Continuous; notes evolve	Versioned with visible `last updated`	Rewritten when underlying article changes materially
Failure mode if skipped	Articles repeat received wisdom; no compounding	Marketing pages assert without defending	Articles get traffic but no conversion or trust
Closest analog	A researcher's lab notebook; a Zettelkasten	A New Yorker feature; Bits about Money	Stripe docs landing page

A KB store occupies the Foundation Research column; the public marketing site occupies the Marketing Page column. The Article column is the layer where prospect-facing writing gets derived from KB research — its presence or absence is the structural argument that closes the methodology.

Foundation roadmap: the 15-piece closure

The 15-piece research-brief roadmap forms a system, not a list. End to end:

Pieces 1-4 (strategic frame): the negative case (most marketing sites do nothing) plus the positive case (the site as platform, not pamphlet), plus Agency methodology for small-business website projects — including the 2026 SEO/GEO/Local Search Playbook for Kitchener-Waterloo — defining the competitive environment.
Pieces 5-7 (infrastructure layer): Agency methodology for small-business website projects on stack ownership; Agency methodology for small-business website projects on IA; Agency methodology for small-business website projects on page speed. The engineering substrate. Without these, the discipline of pieces 8-15 has no durable home.
Pieces 8-9 (build philosophy): Agency methodology for small-business website projects plus Agency methodology for small-business website projects defend platform longevity — the discipline only pays off over time, on infrastructure that survives.
Pieces 10-11 (content asset): Data infrastructure on a small-business budget plus Data infrastructure on a small-business budget make the case that structured, sourced data is the durable competitive asset of a modern services business.
Pieces 12-14 (operationalize the asset): the research-first methodology itself — the workflow stages described above.
Piece 15 (the closing piece): the editorial layer that turns all of the above from a content-strategy story into a credibility story. Every piece in the roadmap makes claims; this piece is the one that says how those claims must be made.

The throughline

A credible SMB website in 2026 is a structured, owned, durable platform whose content is verifiable by both human readers and AI engines.

Each of the 15 pieces is one wall of that house. Piece 15 is the foundation under all of them — which is why it is piece 15 rather than piece 1: it can only be specified once the rest of the system is described, but it underpins the rest in practice.

What this means in operation

Every public article derives from the KB, using a published confidence-label taxonomy applied at the research stage.
Every engagement starts at Stage 1 of the research-first workflow above.
Every claim in deliverables follows a what-to-source checklist applied at the foundation-research stage.
Every published artifact carries a dateModified and a visible last-updated stamp.
Every correction follows a documented retraction-and-correction playbook with visible change history.
The public site is the demonstration of the methodology — the marketing surface IS the operationalization of the playbook.

Related frameworks

The methodology pulls together several established frameworks that originated outside marketing:

Matuschak evergreen notes — Notes are written to be permanent (refactored over time, not appended), atomic (one idea per note), densely linked, and concept-oriented rather than topic-oriented. The Foundation Research column of the layering table is operationally the same artifact as a Matuschak-style evergreen note.
Diátaxis framework (Procida) — Four documentation modes — tutorial, how-to, reference, explanation — each with a distinct user need, voice, and structure. The pre-draft mode check is the single highest-leverage editorial discipline in the methodology; mixing modes inside one page is the most common failure pattern in marketing-site content.
Stripe docs-as-product — Documentation contributions in engineering performance reviews; Markdoc as in-house tooling; structured plus versioned plus cross-linked plus engineered to the same standard as the code. The commercial proof case.
OpenAlex and Semantic Scholar citation graphs — Proof that typed, atomic, link-rich knowledge graphs operate at 271M-node scale.
Our World in Data — Proof that provenance can be enforced by infrastructure rather than goodwill.
Stanford Encyclopedia of Philosophy — Proof that permanent dated editions are achievable and citable.

Sources and confidence

Research-first workflow (Stages 0–5) — internal framework; synthesises Matuschak's writing-inbox terminology, the Zettelkasten tradition, and standard documentation-pipeline conventions. Confidence: Internal-framework / Industry-consensus on each component.
Cadence reference points — Bits about Money (Patrick McKenzie self-described as "roughly monthly"); Construction Physics (Brian Potter, "every, every week or so"); Anthropic Science blog (Features / Workflows / Field Notes structure); Works in Progress (bimonthly print). Confidence: Verified (primary stated cadences).
Tooling stack — Obsidian pricing and licensing (verified); Quartz designed for Obsidian-to-web (verified); Gwern.net on Pandoc + Hakyll (verified); Smart Connections / Copilot for Obsidian as the 2025-2026 RAG layer (verified). Confidence: Verified.
Stripe docs — Mintlify blog https://mintlify.com/blog/stripe-docs; apidog.com analysis; Patrick McKenzie quote. Confidence: Industry-consensus (multiple secondary sources; named-individual primary statements).
Astro Content Collections — https://docs.astro.build/en/guides/content-collections; deepwiki.com/withastro/astro.build for the astro.build site implementation. Confidence: Verified.
OpenAlex — Codina et al. (ResearchGate analysis); 271.3M works as of November 2025; launched 2022 to replace the discontinued Microsoft Academic Graph. Confidence: Verified.
Semantic Scholar — S2AG paper, arXiv:2301.10140; 214M+ publications, 80M+ authors, 2.4B+ citation edges in 2023 snapshot. Confidence: Verified.
Open citation milestone — Hutchins, arXiv:2106.04695; 1 billion public-domain citations crossed February 2021. Confidence: Verified.
Our World in Data — https://ourworldindata.org/faqs; CC-BY licensing, per-indicator JSON/CSV endpoints, GitHub-published processing code. Confidence: Verified.
Stanford Encyclopedia of Philosophy — https://plato.stanford.edu/cite.html; permanent dated archived editions. Confidence: Verified (primary).
Casey Newton / Platformer retreat — Nieman Lab, April 2026; "More scoops, less aggregation and analysis." Confidence: Verified.
HubSpot 2015 volume study — 13,500+ customers; 16+ posts per month → "almost 3.5×" inbound traffic vs. 0–4 posts. Confidence: Industry-consensus on direction; dated dataset; use as steel-man material, not current benchmark.
Stratabeat 2025 companion — 9+ posts per month → 20.1% organic traffic lift, B2B SaaS sample. Confidence: Single-source.
HubSpot historical-optimization study (Pamela Vaughan) — https://blog.hubspot.com/marketing/historical-blog-seo-conversion-optimization; average 106% lift in monthly organic search views on optimized old posts. Confidence: Verified (HubSpot first-party).
Patrick Stox / Ahrefs content-decay framing — "Every piece of content you've ever published is slowly dying." Confidence: Industry-consensus.
RAG threshold and stack — Encore's pgvector benchmarks: "pgvector handles millions of vectors with HNSW indexing. Benchmarks show query times under 20ms at 1M vectors with recall rates above 95%." Confidence: Verified (vendor-published benchmark); 50-page payoff threshold is Internal-framework / Industry-consensus.
Twilio docs-as-code — https://twilio.com/en-us/blog/developers/new-era-for-twilio-documentation. Confidence: Verified (primary).