{"id":1182,"slug":"open-data-as-competitive-moat","title":"Open data as competitive moat","kind":"reference","scope":"marketing-site","status":"current","audiences":["kevin","claude-code","smb-owner","candid-team"],"topics":["open-data","data-infrastructure"],"reference_body":"## Overview\n\nOpen data is the public, royalty-free, government-published baseline on which a large share of the modern data economy is built. Public records become competitive moats not because the data is rare — by definition it is not — but because the work between download and queryable is substantial, and very few SMBs do it. The canonical demonstrations include Carfax (1984 → 35 billion records, sold to S&P Global Mobility in 2022), Zillow (110 million homes on a backbone of Census, ACS, and ~3,000 county assessors), ATTOM (500 million transactions, 2,690+ counties, normalized through a 20-step Enterprise Data Management Program), Cherre (a property knowledge graph powering $3.3 trillion AUM globally), The Climate Corporation (NOAA + NWS + USGS + NRCS + NASA fused with field-level grower data, now a Bayer Crop Science subsidiary), and FlightAware (FAA feeds plus a worldwide network of over 30,000 user-hosted ADS-B receivers).\n\nThe pattern is consistent across verticals: the raw inputs are open to every competitor; the pipeline — canonical entity IDs, normalized schemas, time-series accumulation, provenance, attribution — is what compounds. This page consolidates the licensing landscape (OGL-Canada v2 perpetual royalty-free commercial use, Ontario Open Data Catalogue 2,948 datasets, Statistics Canada Open Licence, ECCC/MSC Data Server End-use Licence, EIA APIv2, CER pipeline data, NOAA/NWS public domain, plus UK OGL v3, OECD CC BY 4.0, Eurostat, and World Bank), the practical 2026 ingestion stack a 1-3 person SMB can run for under $25/month, an industry data map for five service verticals, a 5-stage data maturity curve, worked examples of operators that turned public data into product, and the post-2022 scraping legal landscape (hiQ v LinkedIn under the CFAA; Meta v Bright Data on logged-off scraping).\n\nThe companion [[research-brief-public-data-private-moat]] is the long-form research basis. The licensing primer is at [[ogl-canada-v2-perpetual-royalty-free-commercial]]. The depth mandate here is that every dataset name, licence term, dataset size, source URL, and confidence label survives verbatim.\n\n## Open data ingestion stack for SMB (2026)\n\nFor a 1-3 person team handling millions (not billions) of rows in 2026, a realistic stack runs $5-25 per month. The recommended layers:\n\n| Layer | Recommendation | Monthly cost at small scale |\n|---|---|---|\n| Storage | Backblaze B2 or Cloudflare R2 (Parquet files) | $1-10 |\n| Compute / query | DuckDB locally + MotherDuck free tier (10 GB, 10 CU-hours) | $0 |\n| Scheduling | GitHub Actions (2,000 free Linux min/mo private; unlimited free for public) OR cron on a $5 VPS | $0-5 |\n| Transformation | dbt-core (free) | $0 |\n| Orchestration (optional) | Dagster OSS or Prefect free tier | $0 |\n| Visualization | Observable Framework (free) / Datawrapper (free tier) / Metabase OSS | $0-10 |\n| Total realistic minimum | | $5-25/month |\n\nThe Python-only minimum viable pipeline for an agency just starting:\n\n- `requests` / `httpx` to hit StatCan WDS, EIA APIv2, MSC GeoMet\n- `polars` or `duckdb` for transformation\n- `parquet` files in object storage\n- Cron + `dbt run` against DuckDB nightly\n- Quarto or Observable for the public-facing layer\n\nWatch-outs for 2026: MotherDuck's Business tier moved from $100 to $250 per month ([[motherduck-pricing-changes-2026-business-tier]]). Operators scaling past the free tier should evaluate ClickHouse Cloud or self-hosted DuckDB before committing. For workloads exceeding 100 million rows, BigQuery's per-query pricing often wins; for small scale, DuckDB and MotherDuck win on simplicity. For Postgres-native teams, TimescaleDB on Hetzner ($10-20/month) handles time-series gracefully without learning new tools.\n\n## Modern data stack on a budget (2026)\n\nFor a Canadian service business at $1M-$10M revenue, 5-50 staff, currently living in QuickBooks plus spreadsheets plus a half-used CRM, the architecture pattern is layered:\n\n1. **System of record per data type.** Don't replace QuickBooks or the vertical SaaS; declare them authoritative for their domains.\n2. **Operational database** — managed Postgres (Supabase / Neon / Render / Crunchy Bridge): **$20-50/month**.\n3. **Lightweight ingestion** — dlt (OSS) or Airbyte/Fivetran (managed). Pull from QuickBooks, vertical SaaS, Stripe, Mailchimp, GA: **$0-$200/month**.\n4. **Transformations** — dbt-core (free) running locally or in GitHub Actions, or dbt Cloud Developer ($100/month).\n5. **Analytics engine** — DuckDB (free, in-process) or BigQuery (pay-per-query, often under $30/month at SMB scale).\n6. **Dashboards** — Evidence.dev (free, code-based, version-controlled), or Metabase OSS (~$85/month managed). Avoid Tableau and Power BI at this stage.\n7. **Customer-facing data tools (optional)** — thin internal-only Next.js or Retool app (~$10/user/month).\n\nRealistic cost: **C$100-C$500/month** in software, **20-40 hours** initial setup, **4-8 hours/month** ongoing maintenance. That is roughly the cost of one underused enterprise CRM seat.\n\nWhat this unlocks:\n\n- Single customer view across QuickBooks + vertical SaaS + email + payments\n- Cohort retention analysis the vertical SaaS cannot do\n- Marketing attribution joining ad spend to closed revenue\n- Daily-ops dashboards (Lovett Services-style — see [[research-brief-dataset-is-the-product]])\n- Clean export path for the next system, whatever it is\n- Better privacy posture: answer a PIPEDA SAR or Law 25 portability request without opening five SaaS tools\n\nWhen NOT to build this: sub-$500K revenue; solo operator; owner planning to sell or retire within 18 months; pre-product-market-fit business model (the schema will change weekly and break everything); no one in the business — including agency partners — comfortable with SQL.\n\n## Underexploited Canadian open data, by industry\n\nFor Candid Creative's SMB Kitchener-Waterloo client base, the highest-leverage starting points are:\n\n1. **StatCan Building Permits** — approximately 2,400 municipalities, 95% of the population, monthly with a roughly 6-week lag. The March 2026 release was published May 19, 2026 (<https://www150.statcan.gc.ca/n1/daily-quotidien/260519/dq260519b-eng.htm>). Relevant to construction, real estate, lending, and insurance clients.\n2. **Ontario Open Data Catalogue** — 2,948 datasets ([[ontario-open-data-catalogue-2948-datasets]]). Relevant to local government, civic-tech, and B2G clients.\n3. **EIA APIv2 + CER pipeline data** — energy clients, manufacturers with energy exposure ([[eia-apiv2-free-petroleum-electricity]], [[cer-pipeline-throughput-ogl-canada]]).\n4. **MSC GeoMet** — insurance, agriculture, outdoor services, construction scheduling ([[eccc-msc-open-data-end-use-licence]]).\n5. **StatCan Labour Force Survey** — 7:00 AM ET monthly. Relevant to recruiting, HR-tech, and professional services.\n\nOther industries with underexploited Canadian and U.S. open data:\n\n| Industry | Named dataset | Cadence |\n|---|---|---|\n| **Agriculture** | USDA NASS QuickStats | Updated each weekday |\n| | StatCan Field Crop Reporting Series (32-10-0359-01) | Monthly/seasonal |\n| **Construction** | StatCan Building Permits | Monthly, ~6-week lag |\n| | U.S. Census Building Permits Survey | Monthly |\n| **Fuel / Energy** | EIA APIv2 (Petroleum, Electricity, NatGas) | Weekly/monthly |\n| | Canada Energy Regulator — pipeline throughput | Monthly |\n| | NRCan Comprehensive Energy Use Database | Annual |\n| **Environmental services** | MSC Open Data (weather/climate/water) | Real-time via AMQP |\n| | EPA Envirofacts | Daily |\n| **Transportation** | StatCan Table 23-10-0308-01 (vehicle registrations) | Annual |\n| **Healthcare** | CIHI public reports | Quarterly/annual |\n| | CMS public use files (US) | Varies |\n| **Forestry** | National Forestry Database | Annual |\n\nThe no-scraping discipline is strategic, not just legal. Scraped pipelines break on every website redesign; OGL-Canada feeds do not. See [[rule-build-on-official-open-data-not-scraping]].\n\n## Industry data map: five service verticals\n\nEach vertical generates operational data, typically strands a subset, and could unlock specific decisions if the data were owned at Stage 3 or above.\n\n### Home services (HVAC, plumbing, electrical, landscaping)\n\n- **Generated:** service tickets, equipment serial/install dates, recurring maintenance cadences, customer property characteristics, technician notes, parts used, photos, seasonal demand.\n- **Typically stranded:** equipment registries in technician memory; weather-correlated demand patterns; \"this customer always calls in June\" patterns.\n- **Unlockable:** proactive replacement outreach (12-year-old units); weather-triggered staffing; dynamic emergency pricing; membership pricing reflecting actual visit cost.\n- **Dominant SaaS:** ServiceTitan, Jobber, Housecall Pro, FieldEdge.\n\n### Professional services (legal, accounting, consulting)\n\n- **Generated:** time entries, matter/engagement records, document templates, conflict-check data, client comms, billing realization, write-offs.\n- **Typically stranded:** which engagements were profitable and why; partner-level realization by client/matter type; template reuse efficiency.\n- **Unlockable:** refuse unprofitable engagement types; price fixed-fee work based on actual hours-by-matter-type history; identify payment-delay patterns. Powers the 64% flat-fee-billing trend in mid-sized firms ([[clio-legal-trends-2024-2025-ai-adoption-79-to-93]]).\n- **Dominant SaaS:** Clio (legal), Karbon (accounting), MyCase, PracticePanther.\n\n### Distribution (fuel, building supply, food service)\n\n- **Generated:** delivery schedules, tank/SKU registries, customer-specific contract terms, price-per-gallon by date, route data, will-call vs automatic, seasonal load curves.\n- **Typically stranded:** true per-customer margin (route cost rarely allocated); which customers cost more to serve than they pay; weather × tank-fill correlation; emergency-delivery frequency by customer.\n- **Unlockable:** reprice or fire unprofitable accounts; route optimization; predictive replenishment; targeted contract upsell.\n- **Dominant SaaS:** Cargas Energy, ADD Systems, P-Dispatch (fuel); ECI, DDI System (building supply).\n\n### Trades (roofing, construction, automotive)\n\n- **Generated:** job estimates, change orders, materials cost vs estimate, labor hours by job type, warranty claims, supplier pricing.\n- **Typically stranded:** bid-win-rate by estimator; over/under on materials by project type; warranty cost as true % of revenue.\n- **Unlockable:** estimator-specific feedback loops; supplier negotiation backed by purchase history; warranty-reserve accuracy.\n- **Dominant SaaS:** Procore, Buildertrend, Tekmetric (auto), Shopmonkey.\n\n### Healthcare-adjacent (clinics, vets, dental)\n\n- **Generated:** patient/animal records, treatment plans, recall reminders, prescription history, no-show patterns, insurance reimbursement timing.\n- **Typically stranded:** no-show predictability; recall effectiveness by channel; patient lifetime value by acquisition source.\n- **Unlockable:** predict no-show risk and double-book; optimize recall channel; identify referral sources producing stickiest patients.\n- **Dominant SaaS:** Dentrix, Open Dental, eClinicalWorks, ezyVet.\n- **PHIPA caveat for Ontario:** healthcare data is governed by PHIPA, not PIPEDA — a stricter consent regime.\n\n## Data maturity curve: five stages\n\nA 5-stage spine from \"data exists in the owner's head\" to \"data is the product\":\n\n| Stage | What it looks like | Business profile | What unlocks moving up |\n|---|---|---|---|\n| **1. Stranded** | Data in owner's head, paper files, text threads, unsorted Gmail. Nothing systematic. | Owner-operator, under $500K, ≤5 customers/week, no employees. | Forcing function: first hire, or owner can't remember a recurring customer's last service date. |\n| **2. Captured but unstructured** | QuickBooks holds financials. Google Sheet tracks customers. Calendly handles scheduling. Comms in Gmail. Nothing queryable across silos. | Most $500K-$3M service businesses — the **modal stage**. | Pain of doing month-end manually; a competitor wins a customer the business *had* but didn't follow up on. |\n| **3. Queryable in one place** | Real CRM or vertical SaaS is system of record. Data exportable to CSV and joinable. Reports still manual but possible. | $1M-$10M; 5-40 employees; one dominant operational pattern. | Owner starts asking \"what's our gross margin by customer segment?\" and can't answer in under a day. |\n| **4. Surfaced in daily decisions** | Dashboards exist. At-risk customer alerts fire. Pricing references history. Forecasting uses real seasonal curves. Team operates *from* the data. | $5M+; multi-location, multi-product, or multi-segment. | Investment in an analyst, fractional data person, or tightly-scoped internal tool. |\n| **5. Data as product** | Dataset is sold, licensed, or used as wedge into adjacent markets (Carfax, Zillow). Or it powers structurally cheaper unit economics. | Rare for SMBs. Achievable for industry-leading regional businesses, roll-ups, or unique data positions (e.g., the only fuel distributor with 5 years of weather-correlated tank-fill data in a region). | Strategic decision: is the data a defensible asset or just operational byproduct? |\n\nThe honest 2026 read is that the majority of service businesses sit at Stage 2, and most do not need to move past Stage 3. The argument is about recognizing when an operator should — not arguing everyone must. The Carfax 38-year arc (1984-2022, sold to S&P Global Mobility — [[carfax-1984-10000-records-fax-to-35-billion]]) is the canonical Stage 1 → Stage 5 trajectory. The structural decisions that get you there — canonical entity IDs, normalized schema, time-series accumulation, proper provenance — improve operational value at every stage along the way.\n\n## Canadian open-data sources\n\n### Ontario Open Data Catalogue\n\nThe Ontario Open Data Catalogue contains **2,948 datasets** published under the Open Government Licence — Ontario v1.0. Sources: <https://data.ontario.ca/> and <https://www.ontario.ca/page/open-government-licence-ontario>. Confidence: Verified. Required attribution: *\"Contains information licensed under the Open Government Licence – Ontario.\"* Canada has at least nine substantively different Open Government Licences in active use — federal, Ontario, BC, Alberta, plus municipal (Toronto, Guelph, Winnipeg, Niagara Region, York Region). Attribution and reuse terms differ subtly; the OpenStreetMap Foundation maintains a compatibility list. For Candid KW-region work, the practical stack is OGL-Canada + OGL-Ontario + occasional municipal feeds.\n\n### Statistics Canada Open Licence\n\nThe Statistics Canada Open Licence explicitly permits operators to *\"use, reproduce, publish, freely distribute, or sell value-added products.\"* Source: <https://www.statcan.gc.ca/en/terms-conditions/open-licence>. Confidence: Verified (primary). The required source line is `Source: Statistics Canada, name of product, reference date.` A truncated form (`Statistics Canada, reference year`) plus a linked reference list is acceptable per the StatCan FAQ. Key infrastructure: Statistics Canada operates a **Web Data Service (WDS) REST API** with daily updates at 8:30 AM Eastern. \"The Daily\" — StatCan's official release bulletin — publishes at 8:30 AM Eastern every working day; Labour Force Survey and CPI release at 7:00 AM. The Building Permits Survey covers approximately 2,400 municipalities (95% of population); the March 2026 release was published May 19, 2026. This is the highest-leverage Canadian open dataset for construction, real-estate, lending, and insurance use cases.\n\n### ECCC / MSC open data\n\nEnvironment and Climate Change Canada and the Meteorological Service of Canada (MSC) publish weather, climate, and water datasets via OGC-compliant APIs through MSC GeoMet — *\"anonymous and free of charge.\"* Sources: <https://eccc-msc.github.io/open-data/licence/readme_en/> and <https://eccc-msc.github.io/open-data/msc-geomet/readme_en/>. Confidence: Verified. Required attribution: *\"Contains information licenced under the Data Server End-use Licence of Environment and Climate Change Canada.\"* Note that third-party data inside MSC products may carry separate terms — check per-dataset metadata. Weather data drives insurance, agriculture, outdoor services, and construction scheduling. The MSC AMQP feed delivers near-real-time updates, buildable into alerting products. Pairs with the NOAA/NWS public-domain default for the U.S. side.\n\n### Canada Energy Regulator\n\nThe Canada Energy Regulator publishes pipeline throughput and capacity (Keystone, Trans Mountain, NGTL, Mainline, Foothills, Alliance, Westcoast, etc.), tolls, electricity exports/imports, commodity tracking, Market Snapshots, and Energy Futures projections — all under OGL-Canada. Source: <https://www.cer-rec.gc.ca/en/data-analysis/> and <https://open.canada.ca/data/en/dataset/dc343c43-a592-4a27-8ee7-c77df56afb34> (record last updated 2026-03-07). Confidence: Verified. Pairs with EIA APIv2 for U.S. equivalents — together they enable continental energy market intelligence for industrial, manufacturing, and distribution clients with energy exposure. Companion sources: Natural Resources Canada (NRCan) National Energy Use Database (oee.nrcan.gc.ca) provides sectoral energy market overviews; the Remote Communities Energy Database (atlas.gc.ca/rced-bdece/) offers CSV downloads.\n\nThe federal baseline licence is OGL-Canada v2.0 — perpetual, royalty-free, commercial use permitted. See [[ogl-canada-v2-perpetual-royalty-free-commercial]] for the licensing primer.\n\n## United States open-data sources\n\n### NOAA / NWS public-domain default\n\n*\"The information on National Weather Service Web servers and Web sites is in the public domain.\"* Source: <https://www.weather.gov/disclaimer>. Confidence: Verified. The important \"as is\" clause: NWS data is provided as-is, with all warranties of merchantability and fitness-for-purpose disclaimed. **Reproduce this disclaimer in any derivative product.** Legal frame: U.S. federal government works are public domain under 17 USC §105 — but this applies only to federal works; state and local government works carry their own (often more restrictive) terms. The NOAA name and visual identifier remain trademarked; operators cannot imply endorsement. NOAA Coast Survey (2025+) adopted Creative Commons licensing for external-source data — CC0 default, CC-BY 4.0 with attribution. The federal public-domain default still holds for NOAA-authored works.\n\n### EIA APIv2\n\nThe U.S. Energy Information Administration's **APIv2** is free with a registered key. Coverage: petroleum, electricity, natural gas, coal, nuclear outages, SEDS, STEO (Short-Term Energy Outlook), AEO (Annual), IEO (International). Predictable release schedule: the **Weekly Petroleum Status Report (WPSR)** tables in CSV/XLS release at **10:30 AM ET Wednesdays**; full PDF/HTML after 1:00 PM. Sources: <https://www.eia.gov/opendata/> and <https://www.eia.gov/petroleum/supply/weekly/schedule.php>. Confidence: Verified. EIA terms: *\"EIA data is provided free of charge and should be used in compliance with our Copyrights and Reuse Policy.\"* No required attribution string, but best-practice citation is agency + dataset + retrieval date. The API has rate throttling. Predictable timing equals predictable product. EIA WPSR at 10:30 ET, StatCan Daily at 8:30 ET, MSC near-real-time — these are the cadence anchors for any alerting or dashboard product.\n\n## International open-data licences (2026)\n\nThe international open-data licensing landscape, 2026:\n\n**UK Open Government Licence v3.0:**\n\n- Attribution: *\"Contains public sector information licensed under the Open Government Licence v3.0.\"*\n- <https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/>\n\n**OECD:**\n\n- *\"You can extract from, download, copy, adapt, print, distribute, share and embed Data.\"*\n- **CC BY 4.0** adopted as default for most written content as of **1 July 2024**.\n- <https://www.oecd.org/en/about/oecd-open-by-default-policy.html>\n\n**Eurostat:**\n\n- *\"Reuse is authorised provided the source is acknowledged.\"*\n- Some maps/images carry separate terms per Commission Decision 2011/833/EU.\n- <https://ec.europa.eu/eurostat/help/copyright-notice>\n\n**World Bank Open Data:**\n\n- Default: **CC BY 4.0**.\n- Prescribed attribution: *\"The World Bank: Dataset name: Data source (if known).\"*\n- Microdata and third-party-sourced indicators often have stricter terms.\n- <https://data.worldbank.org/summary-terms-of-use>\n\nThe pattern: CC BY 4.0 is increasingly the international default. Attribution is cheap; mis-attribution can terminate the licence.\n\n## Attribution checklist by source\n\nThe compliance-grade attribution matrix — required strings plus limits plus gotchas per source:\n\n| Source | Required attribution | Limits / gotchas |\n|---|---|---|\n| **OGL-Canada v2.0** | \"Contains information licensed under the Open Government Licence – Canada.\" Link the licence URL. | Cannot use government crests or logos; cannot imply endorsement; personal info + trademarks excluded. |\n| **Statistics Canada Open Licence** | \"Source: Statistics Canada, [name of product], [reference date].\" Truncated form acceptable with linked reference list. | Cannot suggest StatCan endorsement or that the operator has private info about identifiable persons or businesses. Postal products are NOT under this licence. |\n| **ECCC Data Server End-use Licence v2.1** | \"Contains information licenced under the Data Server End-use Licence of Environment and Climate Change Canada.\" | Third-party data inside MSC products may carry separate terms — check per-dataset metadata. |\n| **OGL-Ontario v1.0** | \"Contains information licensed under the Open Government Licence – Ontario.\" | Same exclusions as federal. |\n| **U.S. federal works (17 USC §105)** | No legal requirement; best practice: agency + dataset + retrieval date. | Applies to FEDERAL works only — state and local government has separate terms. |\n| **NOAA / NWS** | Acknowledge NOAA; repeat \"as is\" disclaimer in derivatives. | NOAA name and visual identifier trademarked; cannot imply endorsement. |\n| **EIA** | Cite EIA + dataset + retrieval date. | Free API requires registered key; throttling enforced. |\n| **UK OGL v3.0** | \"Contains public sector information licensed under the Open Government Licence v3.0.\" | Excludes trademarks, third-party rights, personal data. |\n| **OECD** | Source: OECD + dataset + retrieval date. CC BY 4.0 default since July 1, 2024. | Some legacy or non-OECD-authored content remains under different terms. |\n| **Eurostat** | \"Source: Eurostat.\" | Some maps and images carry separate terms per Commission Decision 2011/833/EU. |\n| **World Bank Open Data** | \"The World Bank: [Dataset name]: [Data source if known].\" CC BY 4.0 default. | Microdata + third-party-sourced indicators often have stricter terms. |\n\nCommon compliance failures to avoid:\n\n- Using government logos or wordmarks on dashboards (uniformly prohibited).\n- Implying endorsement by visual proximity (\"Our partners: StatCan, EIA\").\n- Republishing data under a private copyright notice without acknowledging the public source.\n- Linking individual people to aggregate data (e.g., \"this neighborhood's average household income\" plus a named address).\n\n## Worked examples: operators who built products on open data\n\n### ATTOM Data — 500 million transactions, 2,690+ counties\n\nATTOM warehouses *\"more than 500 million real estate and loan transactions in over 2,690 counties,\"* normalized via a **20-step Enterprise Data Management Program**. Coverage: 99% of the U.S. population across 3,000+ counties; **70 billion+ rows; 9,000+ discrete attributes**. Sources: <https://www.attomdata.com/data/transactions-mortgage-data/recorder-data/> and <https://www.attomdata.com/data/>. Confidence: Verified. ATTOM customers — Roofstock, Offerpad, HomeFinder — build downstream products on the normalized public-records base. The persistent ATTOM ID across datasets is the entity-resolution moat — every property gets one canonical identifier whether the source spelled the address differently in two counties. The ATTOM 20-step EDMP is the canonical demonstration that the moat is the pipeline, not the records. Every business in the U.S. has equal legal access to the same county assessor data; ATTOM's value is what happens between download and queryable.\n\n### Carfax — 1984 to 35 billion records\n\nCARFAX was founded in Columbia, Missouri in 1984 by Ewin Barnett III and Robert Daniel Clark. Its first dealer-market vehicle history report (1986) was built on a database of exactly **10,000 records distributed via fax**. Today: **35+ billion records from 151,000+ sources** including motor-vehicle departments for all 50 U.S. states and 10 Canadian provinces. Carfax was sold to **S&P Global Mobility in 2022**. Source: <https://en.wikipedia.org/wiki/Carfax,_Inc.>. Confidence: Verified. The canonical \"operational byproduct becomes the product\" story. Carfax's dealer (B2B) business eventually surpassed its consumer (B2C) business — the database that started as a sales-aid for one industry became the asset. B2B flip case study: 3Pillar Global — <https://www.3pillarglobal.com/insights/case-studies/carfax-b2b-solutions/>. The 1984 → 2022 arc is 38 years. Most service businesses will not live long enough to mature a database to product status — but the structural decisions that make it possible (canonical entity IDs, normalized schema, time-series accumulation, proper provenance) are the same architectural decisions that improve operational value at every stage along the way. The dataset moat is a side effect of doing the operational work properly.\n\n### Zillow — built on the administrative-data backbone\n\nZillow's database is *\"built on a backbone of administrative data\"* — Census, ACS, approximately 3,000 county tax assessments, and sales records. The Zestimate model sits on top of public records, not beside them. Sources: <https://apps.bea.gov/fesac/meetings/2016-06-10/Rao-Presentation-The-Zillow-Experience.pdf> and <https://www.zillow.com/tech/public-data-challenges/>. Confidence: Verified. Practitioner reference from Zillow's own engineering blog:\n\n- **Address Validation Service** runs assessor records against a GIS table of approximately 500,000 city/state/zip/county permutations to catch upstream errors before they reach the front end.\n- **FillRate** per field per county tracks data completeness.\n- **Transaction Latency** = Median (Transaction Recorded Date − Transaction Received Date) — the cleanest \"speed-of-data\" metric in public real-estate engineering.\n\nThe pattern: Zillow is fundamentally an open-data company. The MLS feed is value-added; the public-records cleaning is the moat. Note: Zillow's ZTRAX dataset was discontinued in 2023, but the technical writing remains the best public account of how a major operator handles public-records cleaning.\n\n### The Climate Corporation — NOAA + NWS + USGS + NRCS + NASA\n\nThe Climate Corporation uses NOAA, NWS, USGS, NRCS, and NASA data as the foundation for field-level weather and soil overlay sold to growers. Source: U.S. CIO open-government data report — <https://www.cio.gov/assets/resources/sofit/02.03.sofit.open.govt.open.data.pdf>. Confidence: Verified. Corporate history: founded as a standalone company; acquired by Monsanto in 2013 (approximately $1B); now a **Bayer Crop Science subsidiary** following Bayer's acquisition of Monsanto in June 2018. (Often miscited as \"Bayer Climate\" — the brand is The Climate Corporation.) The pattern: federal weather + soil + agricultural data + field-level user data + ML models = a product sold at field-level granularity.\n\n### FlightAware — 30,000+ crowdsourced ADS-B receivers\n\nFlightAware fuses FAA feeds + 45-country ANSP feeds + *\"a worldwide network of over 30,000 terrestrial ADS-B receivers\"* + Aireon space-based ADS-B. Source: <https://www.flightaware.com/about/datasources/>. Confidence: Verified. The canonical \"free data + scale\" play. FlightAware started as a single FAA feed; the moat became the user-hosted receiver network — a thing FAA-data alone cannot replicate. For Candid SMB clients in fields where customers or community could host instrumentation (weather stations, air-quality sensors, traffic counts), the FlightAware pattern is the model. **ADS-B Exchange** (24,000+ active receivers, ultra-low-latency positions updated every 500 ms — <https://www.adsbexchange.com/>) demonstrates a transparent alternative to FlightAware's receiver network. The same data, different community contracts.\n\n### Local Logic — 100 billion data points (Canadian)\n\nLocal Logic, Montreal-based, claims *\"100B+ data points\"* forming *\"the largest location dataset in real estate.\"* Source: <https://locallogic.co/our-data/>. Confidence: Verified. Canadian relevance: data sources are largely public — StatCan census, municipal geospatial, transit, schools — plus normalized scoring layers. Customers include MLS portals, developers, and real-estate platforms. Demonstrates the Canadian variant of the ATTOM pattern: public records + entity resolution + scoring methodology = a product.\n\n### Cherre — property knowledge graph powering $3.3T AUM\n\nCherre *\"powers the management of $3.3 trillion AUM globally with its proprietary Universal Data Model, Semantic Data Layer, and Knowledge Graph.\"* Source: <https://cherre.com/>. Confidence: Verified. Honest caveat: a third-party startup profile claims Cherre integrates \"200+ third-party datasets\"; Cherre's own platform copy references \"50-plus additional data sources\" for AI entity resolution specifically. Use the verified $3.3T AUM figure as the safer headline. Verify the 200+ claim against original Cherre material before publication. The \"data union\" pattern: Cherre is the clearest articulation — an integration layer that fuses public records, vendor data, and internal data into one property graph. The graph is not the raw data (anyone can buy it); the graph is the **entity resolution** across sources. For real-estate-adjacent Candid clients (mortgage brokers, contractors, property managers), this is the canonical \"what the leverage layer looks like\" reference.\n\n### HelloSafe — StatCan + OSFI barometers\n\nHelloSafe positions itself as *\"Canada's leading platform for comparing insurance and personal financial products,\"* citing **StatCan and OSFI** in its market barometers. Source: <https://hellosafe.ca/en/>. Confidence: Verified. The \"public data → marketing dashboard → backlinks/AI citations → conversions\" pattern. HelloSafe's barometers are not the product; the comparison platform is. But the StatCan/OSFI-fed dashboards are the trust signal, AI-citation surface, and backlink magnet that markets the product. For Candid SMB clients in insurance, finance, and lending verticals, the playbook is: publish a normalized open-data dashboard on a topic customers care about, get cited as a source, capture demand at the bottom of the funnel.\n\n## Scraping legal landscape (post-2022)\n\n### hiQ v LinkedIn (9th Cir., April 2022)\n\nThe Ninth Circuit (April 18, 2022) reaffirmed that scraping publicly accessible data **likely does not violate the CFAA's \"without authorization\" provision**. Source: <https://www.jenner.com/en/news-insights/publications/client-alert-data-scraping-in-hiq-v-linkedin-the-ninth-circuit-reaffirms-narrow-interpretation-of-cfaa>. Confidence: Verified. The cautionary tale: in December 2022, hiQ stipulated to a **$500,000 judgment and permanent injunction** — found liable for **breach of LinkedIn's User Agreement** (logged-in scraping plus use of fake accounts). Source: <https://www.privacyworld.blog/2022/12/linkedins-data-scraping-battle-with-hiq-labs-ends-with-proposed-judgment/>. The bifurcation:\n\n- **Logged-off scraping of public data** — generally permissible under the CFAA.\n- **Logged-in scraping or accepting ToS** — contract-based liability persists.\n\nFor Candid use: open-data feeds (StatCan, EIA, MSC) sidestep both the CFAA question and the contract trap. **OGL-Canada is \"perpetual\"; ToS can change overnight.**\n\n### Meta v Bright Data (N.D. Cal., January 2024)\n\nJudge Edward Chen (N.D. Cal., January 23, 2024): *\"The Facebook and Instagram Terms do not bar logged-off scraping of public data.\"* Source: <https://www.fbm.com/publications/major-decision-affects-law-of-scraping-and-online-data-collection-meta-platforms-v-bright-data/>. Then: February 26, 2024 — **Meta dropped the remaining tortious-interference claim and waived its right to appeal** the Bright Data summary judgment. <https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against-web-scraping-firm-bright-data-that-sold-millions-of-instagram-records/>. Confidence: Verified. Importance: extends the hiQ logic from the CFAA question to breach-of-contract — logged-off access is genuinely unrestricted regardless of platform ToS. But: this is an N.D. Cal. ruling, not yet appealed to a circuit court; consult counsel outside California.\n\n## Operating rules\n\n### Build on official open-data feeds, not scraped sources\n\nWhen building data products, public dashboards, or \"free tools\" for Candid Creative clients, always use **official open-data feeds with explicit licences** — never scraped sources, even when the scraping would be legally defensible. Why:\n\n- **Durability:** OGL-Canada is \"perpetual\" — see [[ogl-canada-v2-perpetual-royalty-free-commercial]]. Scraped pipelines break on every website redesign.\n- **Legal certainty:** hiQ v LinkedIn and Meta v Bright Data protect logged-off scraping under the CFAA, but hiQ still settled for $500K on contract grounds. Open-data feeds have no equivalent contract trap.\n- **Operational hygiene:** A predictable release cadence (StatCan 8:30 ET daily, EIA WPSR 10:30 ET Wednesdays, MSC near-real-time) lets you build alerts and automations a scraper cannot reliably support.\n- **AI citation strategy:** Per Profound's 680M-citation study ([[profound-680m-citations-perplexity-citation-behavior]]), Perplexity favors *\"named sources with verifiable methodology.\"* \"We pulled this from the official StatCan WDS API\" earns citations a scraper never does.\n\nHow to apply:\n\n- Engagement scoping: when a client asks for a public dashboard, first map the underlying data to one or more OGL-Canada / Statistics Canada / ECCC MSC / EIA / NOAA sources.\n- If only scraping would work: name the trade explicitly in the proposal (legal + maintenance + AI-citation costs).\n- The attribution checklist is non-negotiable on every dashboard.\n- The infrastructure stack is the default — total realistic cost $5-25/month at the low end.\n\n### Attribution discipline on every data product\n\nEvery Candid Creative-built data product (public dashboard, calculator, downloadable report, embedded widget) **carries source attribution per the licence terms of the data it uses**. Missing or wrong attribution **automatically terminates the licence**. Why:\n\n- OGL-Canada, OGL-Ontario, OECD CC BY 4.0, Eurostat, World Bank, ECCC MSC End-use Licence, UK OGL v3 — every major open-data licence requires attribution; failure to comply terminates the licence.\n- Attribution is the trust signal AI engines look for — see [[profound-680m-citations-perplexity-citation-behavior]]. A clean methodology page with named sources and verifiable retrieval dates is exactly what gets cited.\n- Mis-attribution risks: (a) accusation of implying government endorsement (explicitly prohibited); (b) loss of the licence; (c) reputational damage to the client.\n\nHow to apply:\n\n- Every page: a \"Sources\" section with the required attribution string per the attribution checklist above.\n- Every chart: source line beneath the chart in the format the licence demands (StatCan format is the strictest reference).\n- Every downloaded CSV: includes a `_SOURCES.txt` file with the attribution strings + retrieval dates.\n- **Never:** use government crests, logos, or visual identifiers; never suggest endorsement; never republish under a Candid-only copyright notice.\n- For mixed-licence dashboards (e.g., StatCan + EIA + MSC on one page), attribute each source separately under the chart that uses it.\n\n## Sources and confidence\n\nAll atomic claims absorbed into this page were originally tagged with confidence labels in the source entries. Summary:\n\n- **OGL-Canada v2.0 — perpetual, royalty-free, commercial use permitted.** Verified. See [[ogl-canada-v2-perpetual-royalty-free-commercial]].\n- **Ontario Open Data Catalogue — 2,948 datasets under OGL-Ontario v1.0.** Verified. <https://data.ontario.ca/>; <https://www.ontario.ca/page/open-government-licence-ontario>.\n- **Statistics Canada Open Licence — permits sale of value-added products; WDS REST API daily 8:30 AM ET.** Verified (primary). <https://www.statcan.gc.ca/en/terms-conditions/open-licence>.\n- **ECCC / MSC GeoMet — anonymous, free; Data Server End-use Licence v2.1.** Verified. <https://eccc-msc.github.io/open-data/licence/readme_en/>.\n- **NOAA / NWS — public domain under 17 USC §105; \"as is\" disclaimer required in derivatives.** Verified. <https://www.weather.gov/disclaimer>.\n- **EIA APIv2 — free with registered key; WPSR 10:30 AM ET Wednesdays.** Verified. <https://www.eia.gov/opendata/>; <https://www.eia.gov/petroleum/supply/weekly/schedule.php>.\n- **CER pipeline data — OGL-Canada; record last updated 2026-03-07.** Verified. <https://www.cer-rec.gc.ca/en/data-analysis/>; <https://open.canada.ca/data/en/dataset/dc343c43-a592-4a27-8ee7-c77df56afb34>.\n- **UK OGL v3.0; OECD CC BY 4.0 (default 1 July 2024); Eurostat; World Bank CC BY 4.0.** Verified. <https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/>; <https://www.oecd.org/en/about/oecd-open-by-default-policy.html>; <https://ec.europa.eu/eurostat/help/copyright-notice>; <https://data.worldbank.org/summary-terms-of-use>.\n- **ATTOM — 500M+ transactions, 2,690+ counties, 70B+ rows, 9,000+ attributes, 20-step EDMP.** Verified. <https://www.attomdata.com/data/transactions-mortgage-data/recorder-data/>; <https://www.attomdata.com/data/>.\n- **Carfax — founded 1984 Columbia MO; 1986 first report on 10,000 fax records; 35B+ records, 151,000+ sources; sold to S&P Global Mobility 2022.** Verified. <https://en.wikipedia.org/wiki/Carfax,_Inc.>; <https://www.3pillarglobal.com/insights/case-studies/carfax-b2b-solutions/>.\n- **Zillow — backbone of administrative data: Census, ACS, ~3,000 county assessors; Address Validation Service against ~500,000 permutations; ZTRAX discontinued 2023.** Verified. <https://apps.bea.gov/fesac/meetings/2016-06-10/Rao-Presentation-The-Zillow-Experience.pdf>; <https://www.zillow.com/tech/public-data-challenges/>.\n- **The Climate Corporation — NOAA + NWS + USGS + NRCS + NASA; Bayer Crop Science subsidiary since June 2018.** Verified. <https://www.cio.gov/assets/resources/sofit/02.03.sofit.open.govt.open.data.pdf>.\n- **FlightAware — FAA + 45-country ANSP + 30,000+ ADS-B receivers + Aireon; ADS-B Exchange 24,000+ receivers, 500 ms updates.** Verified. <https://www.flightaware.com/about/datasources/>; <https://www.adsbexchange.com/>.\n- **Local Logic — 100B+ data points, \"largest location dataset in real estate,\" Montreal.** Verified. <https://locallogic.co/our-data/>.\n- **Cherre — $3.3T AUM via Universal Data Model + Semantic Data Layer + Knowledge Graph; \"50-plus additional data sources\" (own copy) vs \"200+\" (third-party startup profile — verify before publication).** Verified for $3.3T AUM; caveat on the 200+ figure. <https://cherre.com/>.\n- **HelloSafe — Canada's leading insurance/financial comparison platform; cites StatCan + OSFI.** Verified. <https://hellosafe.ca/en/>.\n- **hiQ v LinkedIn (9th Cir., April 18, 2022) — scraping public data likely does not violate CFAA; hiQ stipulated $500K judgment December 2022 on contract grounds.** Verified. <https://www.jenner.com/en/news-insights/publications/client-alert-data-scraping-in-hiq-v-linkedin-the-ninth-circuit-reaffirms-narrow-interpretation-of-cfaa>; <https://www.privacyworld.blog/2022/12/linkedins-data-scraping-battle-with-hiq-labs-ends-with-proposed-judgment/>.\n- **Meta v Bright Data (N.D. Cal., January 23, 2024) — Facebook/Instagram terms don't bar logged-off scraping; Meta dropped remaining claim and waived appeal February 26, 2024.** Verified; N.D. Cal. ruling not yet appealed to a circuit court. <https://www.fbm.com/publications/major-decision-affects-law-of-scraping-and-online-data-collection-meta-platforms-v-bright-data/>; <https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against-web-scraping-firm-bright-data-that-sold-millions-of-instagram-records/>.\n- **StatCan Building Permits — ~2,400 municipalities (95% population), monthly with ~6-week lag; March 2026 release published May 19, 2026.** Verified. <https://www150.statcan.gc.ca/n1/daily-quotidien/260519/dq260519b-eng.htm>.\n\nConfidence labels survive verbatim from the original atomic entries. Where third-party claims are used (e.g., the Cherre \"200+ datasets\" figure), the caveat is preserved. The full long-form research basis is [[research-brief-public-data-private-moat]].","rationale_body":"Consolidated topic page absorbing 25 atomic source entries per KB-CONSOLIDATION-PLAN.md (2026-06-11).","metadata":{"kb_role":"topic","word_count":5345,"last_updated":"2026-06-11","absorbed_count":25},"links":{"outgoing":[],"incoming":[]},"created_at":"2026-06-11T13:50:19.508Z","updated_at":"2026-06-11T13:50:19.508Z"}