Add ONBOARDING.md for handoff context

2026-04-24 10:29:40 +10:00
parent cc91427789
commit 56dde9cd88
2 changed files with 132 additions and 5 deletions
--- a/ONBOARDING.md
+++ b/ONBOARDING.md
@@ -0,0 +1,125 @@
+# Onboarding
+
+Welcome — this doc is the human-facing context. See [README.md](README.md) for
+the repo tour and [n8n/PROCESS.md](n8n/PROCESS.md) for the authoritative
+end-to-end flow.
+
+## What this project is for
+
+Funeral Arranger is a consumer platform that helps Australian families compare
+funeral directors and arrange services online. To be useful on day one, it
+needs broad coverage of providers — not just the ones who've signed up.
+
+This pipeline solves that. It populates the platform with every funeral
+director we can find in Australia, pulling from public registers and the
+providers' own websites. All auto-discovered providers stay **hidden and
+unverified** until an admin manually reviews them; signed-up partners
+(`verified = 1`) are authoritative and the pipeline never touches them.
+
+The goal is a long tail of "listed" and "estimated" providers that motivates
+real sign-ups — a provider who sees their competitor with full pricing on our
+site has an incentive to claim and upgrade their own listing.
+
+## Your first hour
+
+1. Read [n8n/PROCESS.md](n8n/PROCESS.md) end to end. It's the single most
+   useful doc in the repo.
+2. Poke at the included dev database:
+   ```bash
+   sqlite3 database/providers.db
+   .tables
+   SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier;
+   SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10;
+   SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items
+     FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id
+     GROUP BY p.id ORDER BY line_items DESC LIMIT 10;
+   ```
+3. Skim `crawlers/PIPELINE.md` for module-level detail.
+4. Open one of the n8n workflow JSONs (e.g.
+   `n8n/workflows/3_daily_enrichment.json`) to see the graph shape. You don't
+   need n8n running to read them.
+
+## Ground rules
+
+- **Never write to `funeral_brand.verified` or `funeral_brand.hidden`.**
+  Those flip states belong to the admin review flow, which is a separate
+  frontend project. If something in the pipeline needs to signal "this is
+  ready to review", use `enrichment_status = 'complete'` and a good
+  `listing_tier`, not these columns.
+- **Don't use Gathered Here as a source of truth.** It's a competitor.
+  `crawlers/crawl_gathered_here.py` is historical tooling kept for reference —
+  it's not part of the active workflows. Enrichment must come from the
+  provider's own website or regulatory disclosure PDFs.
+- **`listing_tier` is derived, not authored.** `compute_tiers.py` recomputes
+  it after every enrichment pass from package/inclusion data. Don't set it
+  manually.
+- **Pipeline only owns a subset of columns.** The mapping is in
+  [n8n/PROCESS.md](n8n/PROCESS.md) under "Columns the pipeline writes vs
+  leaves alone". Branding, `funeral_home_id`, and admin flags are all
+  out-of-scope.
+
+## API keys and costs
+
+You will need your own keys — none are included.
+
+| Service | Required? | Cost | Where |
+|---|---|---|---|
+| Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev |
+| ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices |
+| Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com |
+
+Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets
+saved into `source_record` for manual inspection. You just don't get the
+structured `package` / `package_inclusion` output.
+
+```bash
+cp crawlers/config.example.json crawlers/config.json
+# edit crawlers/config.json
+```
+
+## What's in play
+
+**Working and tested:**
+- Three source crawlers (VIC Register, Funerals Australia, NFDA)
+- Cross-source dedup (fuzzy name + postcode + ABN)
+- Website discovery (Serper + DDG fallback + URL guessing)
+- ABN validation
+- Pricing page discovery and text extraction
+- Four n8n workflow JSONs
+
+**Open work (any of these is a reasonable pickup):**
+- Google Places integration for location rating / `place_id` / richer address
+- Playwright-based enrichment for JS-rendered sites (currently fails on ~37%
+  of sites that SSR nothing useful)
+- Admin review UI — needs design input, not just code
+- Stale-package cleanup in the monthly refresh workflow (new packages get
+  inserted alongside old; no tombstoning yet)
+- PDF pricing extraction for InvoCare-style sites (PDF links are captured in
+  `source_record.raw_data.pdf_links` but not yet parsed)
+
+None of these are urgent. Ask before starting on the admin UI — it's coupled
+to the main platform and needs alignment.
+
+## How to get unblocked
+
+- **Technical questions:** ping Richie (contact below). Same-day turnaround on
+  weekdays AEST.
+- **Access to the main platform / staging env:** not set up yet — this repo
+  is self-contained and doesn't need it.
+- **Access to production data beyond `providers.db`:** ask. We can export
+  anonymised slices if you need them.
+
+## Workflow conventions
+
+- Branch naming: `feature/<short-name>` or `fix/<short-name>`.
+- PRs welcome on Gitea; no required reviewers but please ping Richie before
+  merging anything that changes schema or touches n8n workflow JSON.
+- Keep commits focused. If you need to restructure, do it in its own commit
+  before the functional change.
+- No force-pushes to `main`.
+
+## Contact
+
+Richie — richie@tensordesign.com.au
+
+Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl