From 56dde9cd88d636bf68d35df3b81faabb450a5e76 Mon Sep 17 00:00:00 2001 From: Richie Date: Fri, 24 Apr 2026 10:29:40 +1000 Subject: [PATCH] Add ONBOARDING.md for handoff context --- ONBOARDING.md | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 12 +++-- 2 files changed, 132 insertions(+), 5 deletions(-) create mode 100644 ONBOARDING.md diff --git a/ONBOARDING.md b/ONBOARDING.md new file mode 100644 index 0000000..f8f42b1 --- /dev/null +++ b/ONBOARDING.md @@ -0,0 +1,125 @@ +# Onboarding + +Welcome — this doc is the human-facing context. See [README.md](README.md) for +the repo tour and [n8n/PROCESS.md](n8n/PROCESS.md) for the authoritative +end-to-end flow. + +## What this project is for + +Funeral Arranger is a consumer platform that helps Australian families compare +funeral directors and arrange services online. To be useful on day one, it +needs broad coverage of providers — not just the ones who've signed up. + +This pipeline solves that. It populates the platform with every funeral +director we can find in Australia, pulling from public registers and the +providers' own websites. All auto-discovered providers stay **hidden and +unverified** until an admin manually reviews them; signed-up partners +(`verified = 1`) are authoritative and the pipeline never touches them. + +The goal is a long tail of "listed" and "estimated" providers that motivates +real sign-ups — a provider who sees their competitor with full pricing on our +site has an incentive to claim and upgrade their own listing. + +## Your first hour + +1. Read [n8n/PROCESS.md](n8n/PROCESS.md) end to end. It's the single most + useful doc in the repo. +2. Poke at the included dev database: + ```bash + sqlite3 database/providers.db + .tables + SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier; + SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10; + SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items + FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id + GROUP BY p.id ORDER BY line_items DESC LIMIT 10; + ``` +3. Skim `crawlers/PIPELINE.md` for module-level detail. +4. Open one of the n8n workflow JSONs (e.g. + `n8n/workflows/3_daily_enrichment.json`) to see the graph shape. You don't + need n8n running to read them. + +## Ground rules + +- **Never write to `funeral_brand.verified` or `funeral_brand.hidden`.** + Those flip states belong to the admin review flow, which is a separate + frontend project. If something in the pipeline needs to signal "this is + ready to review", use `enrichment_status = 'complete'` and a good + `listing_tier`, not these columns. +- **Don't use Gathered Here as a source of truth.** It's a competitor. + `crawlers/crawl_gathered_here.py` is historical tooling kept for reference — + it's not part of the active workflows. Enrichment must come from the + provider's own website or regulatory disclosure PDFs. +- **`listing_tier` is derived, not authored.** `compute_tiers.py` recomputes + it after every enrichment pass from package/inclusion data. Don't set it + manually. +- **Pipeline only owns a subset of columns.** The mapping is in + [n8n/PROCESS.md](n8n/PROCESS.md) under "Columns the pipeline writes vs + leaves alone". Branding, `funeral_home_id`, and admin flags are all + out-of-scope. + +## API keys and costs + +You will need your own keys — none are included. + +| Service | Required? | Cost | Where | +|---|---|---|---| +| Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev | +| ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices | +| Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com | + +Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets +saved into `source_record` for manual inspection. You just don't get the +structured `package` / `package_inclusion` output. + +```bash +cp crawlers/config.example.json crawlers/config.json +# edit crawlers/config.json +``` + +## What's in play + +**Working and tested:** +- Three source crawlers (VIC Register, Funerals Australia, NFDA) +- Cross-source dedup (fuzzy name + postcode + ABN) +- Website discovery (Serper + DDG fallback + URL guessing) +- ABN validation +- Pricing page discovery and text extraction +- Four n8n workflow JSONs + +**Open work (any of these is a reasonable pickup):** +- Google Places integration for location rating / `place_id` / richer address +- Playwright-based enrichment for JS-rendered sites (currently fails on ~37% + of sites that SSR nothing useful) +- Admin review UI — needs design input, not just code +- Stale-package cleanup in the monthly refresh workflow (new packages get + inserted alongside old; no tombstoning yet) +- PDF pricing extraction for InvoCare-style sites (PDF links are captured in + `source_record.raw_data.pdf_links` but not yet parsed) + +None of these are urgent. Ask before starting on the admin UI — it's coupled +to the main platform and needs alignment. + +## How to get unblocked + +- **Technical questions:** ping Richie (contact below). Same-day turnaround on + weekdays AEST. +- **Access to the main platform / staging env:** not set up yet — this repo + is self-contained and doesn't need it. +- **Access to production data beyond `providers.db`:** ask. We can export + anonymised slices if you need them. + +## Workflow conventions + +- Branch naming: `feature/` or `fix/`. +- PRs welcome on Gitea; no required reviewers but please ping Richie before + merging anything that changes schema or touches n8n workflow JSON. +- Keep commits focused. If you need to restructure, do it in its own commit + before the functional change. +- No force-pushes to `main`. + +## Contact + +Richie — richie@tensordesign.com.au + +Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl diff --git a/README.md b/README.md index 841cfc8..c6988e0 100644 --- a/README.md +++ b/README.md @@ -13,14 +13,16 @@ n8n/ n8n workflows that orchestrate the crawlers on a schedule database/ SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB ``` -Three documents explain how it works, in increasing depth: +Four documents explain how it works, in increasing depth: -1. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough +1. **[ONBOARDING.md](ONBOARDING.md)** — Human context: what the project is + for, ground rules, open work, how to get unblocked. **Start here if you're + new to the project.** +2. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough of the four workflows and how their output maps to database tables. - **Start here.** -2. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the +3. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the Python modules, source-by-source notes, listing-tier logic. -3. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with +4. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with Docker and import the workflow JSONs. ## Quick start (local, no n8n)