From 56dde9cd88d636bf68d35df3b81faabb450a5e76 Mon Sep 17 00:00:00 2001
From: Richie <richie@tensordesign.com.au>
Date: Fri, 24 Apr 2026 10:29:40 +1000
Subject: [PATCH] Add ONBOARDING.md for handoff context

---
 ONBOARDING.md | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++
 README.md     |  12 +++--
 2 files changed, 132 insertions(+), 5 deletions(-)
 create mode 100644 ONBOARDING.md

diff --git a/ONBOARDING.md b/ONBOARDING.md
new file mode 100644
index 0000000..f8f42b1
--- /dev/null
+++ b/ONBOARDING.md
@@ -0,0 +1,125 @@
+# Onboarding
+
+Welcome — this doc is the human-facing context. See [README.md](README.md) for
+the repo tour and [n8n/PROCESS.md](n8n/PROCESS.md) for the authoritative
+end-to-end flow.
+
+## What this project is for
+
+Funeral Arranger is a consumer platform that helps Australian families compare
+funeral directors and arrange services online. To be useful on day one, it
+needs broad coverage of providers — not just the ones who've signed up.
+
+This pipeline solves that. It populates the platform with every funeral
+director we can find in Australia, pulling from public registers and the
+providers' own websites. All auto-discovered providers stay **hidden and
+unverified** until an admin manually reviews them; signed-up partners
+(`verified = 1`) are authoritative and the pipeline never touches them.
+
+The goal is a long tail of "listed" and "estimated" providers that motivates
+real sign-ups — a provider who sees their competitor with full pricing on our
+site has an incentive to claim and upgrade their own listing.
+
+## Your first hour
+
+1. Read [n8n/PROCESS.md](n8n/PROCESS.md) end to end. It's the single most
+   useful doc in the repo.
+2. Poke at the included dev database:
+   ```bash
+   sqlite3 database/providers.db
+   .tables
+   SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier;
+   SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10;
+   SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items
+     FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id
+     GROUP BY p.id ORDER BY line_items DESC LIMIT 10;
+   ```
+3. Skim `crawlers/PIPELINE.md` for module-level detail.
+4. Open one of the n8n workflow JSONs (e.g.
+   `n8n/workflows/3_daily_enrichment.json`) to see the graph shape. You don't
+   need n8n running to read them.
+
+## Ground rules
+
+- **Never write to `funeral_brand.verified` or `funeral_brand.hidden`.**
+  Those flip states belong to the admin review flow, which is a separate
+  frontend project. If something in the pipeline needs to signal "this is
+  ready to review", use `enrichment_status = 'complete'` and a good
+  `listing_tier`, not these columns.
+- **Don't use Gathered Here as a source of truth.** It's a competitor.
+  `crawlers/crawl_gathered_here.py` is historical tooling kept for reference —
+  it's not part of the active workflows. Enrichment must come from the
+  provider's own website or regulatory disclosure PDFs.
+- **`listing_tier` is derived, not authored.** `compute_tiers.py` recomputes
+  it after every enrichment pass from package/inclusion data. Don't set it
+  manually.
+- **Pipeline only owns a subset of columns.** The mapping is in
+  [n8n/PROCESS.md](n8n/PROCESS.md) under "Columns the pipeline writes vs
+  leaves alone". Branding, `funeral_home_id`, and admin flags are all
+  out-of-scope.
+
+## API keys and costs
+
+You will need your own keys — none are included.
+
+| Service | Required? | Cost | Where |
+|---|---|---|---|
+| Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev |
+| ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices |
+| Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com |
+
+Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets
+saved into `source_record` for manual inspection. You just don't get the
+structured `package` / `package_inclusion` output.
+
+```bash
+cp crawlers/config.example.json crawlers/config.json
+# edit crawlers/config.json
+```
+
+## What's in play
+
+**Working and tested:**
+- Three source crawlers (VIC Register, Funerals Australia, NFDA)
+- Cross-source dedup (fuzzy name + postcode + ABN)
+- Website discovery (Serper + DDG fallback + URL guessing)
+- ABN validation
+- Pricing page discovery and text extraction
+- Four n8n workflow JSONs
+
+**Open work (any of these is a reasonable pickup):**
+- Google Places integration for location rating / `place_id` / richer address
+- Playwright-based enrichment for JS-rendered sites (currently fails on ~37%
+  of sites that SSR nothing useful)
+- Admin review UI — needs design input, not just code
+- Stale-package cleanup in the monthly refresh workflow (new packages get
+  inserted alongside old; no tombstoning yet)
+- PDF pricing extraction for InvoCare-style sites (PDF links are captured in
+  `source_record.raw_data.pdf_links` but not yet parsed)
+
+None of these are urgent. Ask before starting on the admin UI — it's coupled
+to the main platform and needs alignment.
+
+## How to get unblocked
+
+- **Technical questions:** ping Richie (contact below). Same-day turnaround on
+  weekdays AEST.
+- **Access to the main platform / staging env:** not set up yet — this repo
+  is self-contained and doesn't need it.
+- **Access to production data beyond `providers.db`:** ask. We can export
+  anonymised slices if you need them.
+
+## Workflow conventions
+
+- Branch naming: `feature/<short-name>` or `fix/<short-name>`.
+- PRs welcome on Gitea; no required reviewers but please ping Richie before
+  merging anything that changes schema or touches n8n workflow JSON.
+- Keep commits focused. If you need to restructure, do it in its own commit
+  before the functional change.
+- No force-pushes to `main`.
+
+## Contact
+
+Richie — richie@tensordesign.com.au
+
+Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl
diff --git a/README.md b/README.md
index 841cfc8..c6988e0 100644
--- a/README.md
+++ b/README.md
@@ -13,14 +13,16 @@ n8n/         n8n workflows that orchestrate the crawlers on a schedule
 database/    SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB
 ```
 
-Three documents explain how it works, in increasing depth:
+Four documents explain how it works, in increasing depth:
 
-1. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough
+1. **[ONBOARDING.md](ONBOARDING.md)** — Human context: what the project is
+   for, ground rules, open work, how to get unblocked. **Start here if you're
+   new to the project.**
+2. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough
    of the four workflows and how their output maps to database tables.
-   **Start here.**
-2. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the
+3. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the
    Python modules, source-by-source notes, listing-tier logic.
-3. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with
+4. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with
    Docker and import the workflow JSONs.
 
 ## Quick start (local, no n8n)