Research brief: Public data as a private moat — building proprietary intelligence from government open data (piece 11 of 15)
Status: Research brief — not finished article. Compiled May 2026.
Thesis
Free is not the moat — clean is. OGL-Canada and U.S. public-domain works (17 USC §105) give every business identical raw rights. The durable advantage comes from normalization, time-series accumulation, and provenance tracking — the gap between legally free and operationally usable is where small teams can compete with much larger incumbents who underinvest in cleaning.
The canonical demonstrations: Zillow (110M-home "living database" on Census/ACS/county assessor records), ATTOM Data (500M+ transactions across 2,690+ counties, 20-step normalization), Cherre ($3.3T AUM powered by a property knowledge graph fusing public + vendor data), FlightAware (FAA + 45 ANSPs + 30,000 user-hosted ADS-B receivers + Aireon), The Climate Corporation (NOAA/NWS/USGS/NRCS/NASA → Bayer subsidiary).
What changed in 2026
- Legal: Ninth Circuit hiQ v LinkedIn (Apr 2022) + Meta v Bright Data (N.D. Cal. Jan 2024) confirm logged-off scraping of public data is generally permissible under CFAA. But hiQ's $500k judgment for User-Agreement breach (logged-in scraping + fake accounts) shows the contract trap remains.
- Infrastructure: A 1-3-person operation can run a real ELT stack (DuckDB local + MotherDuck free tier + dbt + GitHub Actions) for under $50/month if they stay within free tiers. MotherDuck's Business tier moved $100→$250/month between Dec 2025 and Feb 2026, so the cheap-managed-OLAP window may be closing.
- AI citation: Perplexity averages 21.87 citations per response (Qwairy Q3 2025); favors pages with "visible statistics and proprietary data, named sources with verifiable methodology." Normalized open-data dashboards hit all three.
Honest caveats
- Cherre's "200+ datasets" figure is from a third-party profile; Cherre's own platform copy says "50-plus additional data sources." Use the verified $3.3T AUM figure instead.
- Our World in Data's 89M visitor figure is 2021 — no current comparable figure located.
- MotherDuck pricing change is sourced from independent technical blogs; MotherDuck did not announce publicly. Verify before quoting in a deliverable.
- hiQ precedent is Ninth Circuit-specific; consult counsel outside California.
- Defamation/accuracy risk is real when republishing claims about identifiable persons/businesses from open data — aggregate to neighbourhood/postal-code level, reproduce the StatCan "as is" disclaimer, carry a public errata policy.
Related
- reference OGL-Canada v2.0: worldwide royalty-free perpetual licence for commercial use
- reference Statistics Canada Open Licence: explicitly permits "use, reproduce, publish, freely distribute, or sell value-added products"
- reference ECCC/MSC Open Data: free anonymous access to weather/climate/water via OGC-compliant GeoMet APIs
- reference Canada Energy Regulator: pipeline throughput/capacity/tolls + Market Snapshots, all under OGL-Canada
- reference Ontario Open Data Catalogue: 2,948 datasets under OGL-Ontario v1.0
- reference U.S. EIA APIv2: free registered-key access to petroleum/electricity/gas/coal/STEO/AEO data; WPSR releases 10:30 AM ET Wednesdays
- reference NOAA/NWS: information on NWS web servers is in the public domain — no attribution required, provided "as is"
- reference International open-data licences (2026): UK OGL v3, OECD CC BY 4.0, Eurostat, World Bank
- reference hiQ v LinkedIn (9th Cir. Apr 2022): scraping publicly accessible data likely doesn't violate CFAA — but hiQ still settled for $500K
- reference Meta v Bright Data (Jan 2024, N.D. Cal.): Facebook/Instagram terms don't bar logged-off scraping of public data
- reference Zillow: 110M-home "living database" built on Census/ACS + 3,000 county assessors + USPS + MLS feeds
- reference ATTOM Data: 500M+ real estate/loan transactions, 2,690+ counties, 20-step Enterprise Data Management Program
- reference Cherre: property knowledge graph powering management of $3.3T AUM globally
- reference FlightAware: FAA + 45-country ANSP feeds + 30,000+ user-hosted ADS-B receivers + Aireon — the crowdsourced moat
- reference The Climate Corporation (Bayer Crop Science since 2018): field-level overlay on NOAA + NWS + USGS + NRCS + NASA
- reference Local Logic (Montreal): 100B+ data points — "largest location dataset in real estate"
- reference HelloSafe: "Canada's leading insurance/financial comparison platform" — built on StatCan + OSFI data
- reference Carfax: from 10,000 records faxed in 1986 to 35B+ records across 151,000+ sources — sold to S&P Global Mobility 2022
- reference MotherDuck pricing 2026: Lite ($25/mo) removed; Business moved to $250/mo between Dec 2025 and Feb 2026
- reference Reference: open-data ingestion stack for a 1-3 person SMB operation (2026) — under $50/mo realistic
- reference Profound (Aug 2024-Jun 2025, 680M citations): only 11% domain overlap between ChatGPT and Perplexity; 13.7% between AI Overviews and AI Mode
- reference Reference: compliance-grade attribution checklist by open-data source
- reference Reference: underexploited Canadian open data by industry — highest-leverage starts for KW SMB clients
- rule RULE: Build Candid client data products on official open-data feeds — never on scraped sources
- rule RULE: Every Candid data product carries source attribution per the [[attribution-checklist-by-source]]. Mis-attribution terminates the licence.
- reference Research brief: Structured content as a competitive advantage (piece 2 of 15)
- reference Research brief: What makes a marketing site do something (piece on brochure vs platform)
- reference Research brief: Built to Last — why most SMB sites rebuild every 3-4 years (piece 5 of 15)