Reference: open-data ingestion stack for a 1-3 person SMB operation (2026) — under $50/mo realistic
Created 2026-05-22
Recommended stack for a 1-3 person team handling millions (not billions) of rows in 2026:
| Layer | Recommendation | Monthly cost @ small scale |
|---|---|---|
| Storage | Backblaze B2 or Cloudflare R2 (Parquet files) | $1-10 |
| Compute / query | DuckDB locally + MotherDuck free tier (10GB, 10 CU-hours) | $0 |
| Scheduling | GitHub Actions (2,000 free Linux min/mo private; unlimited free for public) OR cron on a $5 VPS | $0-5 |
| Transformation | dbt-core (free) | $0 |
| Orchestration (optional) | Dagster OSS or Prefect free tier | $0 |
| Visualization | Observable Framework (free) / Datawrapper (free tier) / Metabase OSS | $0-10 |
| Total realistic minimum | $5-25/month |
Python-only minimum viable pipeline (for an agency just starting):
requests/httpxto hit StatCan WDS, EIA APIv2, MSC GeoMetpolarsorduckdbfor transformationparquetfiles in object storage- Cron +
dbt runagainst DuckDB nightly - Quarto or Observable for the public-facing layer
Watch-outs (2026):
- MotherDuck Business tier moved $100 → $250/month (MotherDuck pricing 2026: Lite ($25/mo) removed; Business moved to $250/mo between Dec 2025 and Feb 2026). If scaling past free tier, evaluate ClickHouse Cloud or self-hosted DuckDB before committing.
- For >100M-row workloads, BigQuery per-query pricing often wins; for small scale, DuckDB/MotherDuck wins on simplicity.
- For Postgres-native teams, TimescaleDB on Hetzner ($10-20/month) handles time-series gracefully without learning new tools.
Referenced by (3)
- rule RULE: Build Candid client data products on official open-data feeds — never on scraped sources depends-on
- reference Research brief: Public data as a private moat — building proprietary intelligence from government open data (piece 11 of 15) relates-to
- reference Reference: minimum viable data stack for a $1M-$10M Canadian service business (2026, C$100-C$500/month) relates-to