Add ONBOARDING.md for handoff context

This commit is contained in:
Richie
2026-04-24 10:29:40 +10:00
parent cc91427789
commit 56dde9cd88
2 changed files with 132 additions and 5 deletions

125
ONBOARDING.md Normal file
View File

@@ -0,0 +1,125 @@
# Onboarding
Welcome — this doc is the human-facing context. See [README.md](README.md) for
the repo tour and [n8n/PROCESS.md](n8n/PROCESS.md) for the authoritative
end-to-end flow.
## What this project is for
Funeral Arranger is a consumer platform that helps Australian families compare
funeral directors and arrange services online. To be useful on day one, it
needs broad coverage of providers — not just the ones who've signed up.
This pipeline solves that. It populates the platform with every funeral
director we can find in Australia, pulling from public registers and the
providers' own websites. All auto-discovered providers stay **hidden and
unverified** until an admin manually reviews them; signed-up partners
(`verified = 1`) are authoritative and the pipeline never touches them.
The goal is a long tail of "listed" and "estimated" providers that motivates
real sign-ups — a provider who sees their competitor with full pricing on our
site has an incentive to claim and upgrade their own listing.
## Your first hour
1. Read [n8n/PROCESS.md](n8n/PROCESS.md) end to end. It's the single most
useful doc in the repo.
2. Poke at the included dev database:
```bash
sqlite3 database/providers.db
.tables
SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier;
SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10;
SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items
FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id
GROUP BY p.id ORDER BY line_items DESC LIMIT 10;
```
3. Skim `crawlers/PIPELINE.md` for module-level detail.
4. Open one of the n8n workflow JSONs (e.g.
`n8n/workflows/3_daily_enrichment.json`) to see the graph shape. You don't
need n8n running to read them.
## Ground rules
- **Never write to `funeral_brand.verified` or `funeral_brand.hidden`.**
Those flip states belong to the admin review flow, which is a separate
frontend project. If something in the pipeline needs to signal "this is
ready to review", use `enrichment_status = 'complete'` and a good
`listing_tier`, not these columns.
- **Don't use Gathered Here as a source of truth.** It's a competitor.
`crawlers/crawl_gathered_here.py` is historical tooling kept for reference —
it's not part of the active workflows. Enrichment must come from the
provider's own website or regulatory disclosure PDFs.
- **`listing_tier` is derived, not authored.** `compute_tiers.py` recomputes
it after every enrichment pass from package/inclusion data. Don't set it
manually.
- **Pipeline only owns a subset of columns.** The mapping is in
[n8n/PROCESS.md](n8n/PROCESS.md) under "Columns the pipeline writes vs
leaves alone". Branding, `funeral_home_id`, and admin flags are all
out-of-scope.
## API keys and costs
You will need your own keys — none are included.
| Service | Required? | Cost | Where |
|---|---|---|---|
| Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev |
| ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices |
| Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com |
Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets
saved into `source_record` for manual inspection. You just don't get the
structured `package` / `package_inclusion` output.
```bash
cp crawlers/config.example.json crawlers/config.json
# edit crawlers/config.json
```
## What's in play
**Working and tested:**
- Three source crawlers (VIC Register, Funerals Australia, NFDA)
- Cross-source dedup (fuzzy name + postcode + ABN)
- Website discovery (Serper + DDG fallback + URL guessing)
- ABN validation
- Pricing page discovery and text extraction
- Four n8n workflow JSONs
**Open work (any of these is a reasonable pickup):**
- Google Places integration for location rating / `place_id` / richer address
- Playwright-based enrichment for JS-rendered sites (currently fails on ~37%
of sites that SSR nothing useful)
- Admin review UI — needs design input, not just code
- Stale-package cleanup in the monthly refresh workflow (new packages get
inserted alongside old; no tombstoning yet)
- PDF pricing extraction for InvoCare-style sites (PDF links are captured in
`source_record.raw_data.pdf_links` but not yet parsed)
None of these are urgent. Ask before starting on the admin UI — it's coupled
to the main platform and needs alignment.
## How to get unblocked
- **Technical questions:** ping Richie (contact below). Same-day turnaround on
weekdays AEST.
- **Access to the main platform / staging env:** not set up yet — this repo
is self-contained and doesn't need it.
- **Access to production data beyond `providers.db`:** ask. We can export
anonymised slices if you need them.
## Workflow conventions
- Branch naming: `feature/<short-name>` or `fix/<short-name>`.
- PRs welcome on Gitea; no required reviewers but please ping Richie before
merging anything that changes schema or touches n8n workflow JSON.
- Keep commits focused. If you need to restructure, do it in its own commit
before the functional change.
- No force-pushes to `main`.
## Contact
Richie — richie@tensordesign.com.au
Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl