Add ONBOARDING.md for handoff context
This commit is contained in:
125
ONBOARDING.md
Normal file
125
ONBOARDING.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# Onboarding
|
||||
|
||||
Welcome — this doc is the human-facing context. See [README.md](README.md) for
|
||||
the repo tour and [n8n/PROCESS.md](n8n/PROCESS.md) for the authoritative
|
||||
end-to-end flow.
|
||||
|
||||
## What this project is for
|
||||
|
||||
Funeral Arranger is a consumer platform that helps Australian families compare
|
||||
funeral directors and arrange services online. To be useful on day one, it
|
||||
needs broad coverage of providers — not just the ones who've signed up.
|
||||
|
||||
This pipeline solves that. It populates the platform with every funeral
|
||||
director we can find in Australia, pulling from public registers and the
|
||||
providers' own websites. All auto-discovered providers stay **hidden and
|
||||
unverified** until an admin manually reviews them; signed-up partners
|
||||
(`verified = 1`) are authoritative and the pipeline never touches them.
|
||||
|
||||
The goal is a long tail of "listed" and "estimated" providers that motivates
|
||||
real sign-ups — a provider who sees their competitor with full pricing on our
|
||||
site has an incentive to claim and upgrade their own listing.
|
||||
|
||||
## Your first hour
|
||||
|
||||
1. Read [n8n/PROCESS.md](n8n/PROCESS.md) end to end. It's the single most
|
||||
useful doc in the repo.
|
||||
2. Poke at the included dev database:
|
||||
```bash
|
||||
sqlite3 database/providers.db
|
||||
.tables
|
||||
SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier;
|
||||
SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10;
|
||||
SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items
|
||||
FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id
|
||||
GROUP BY p.id ORDER BY line_items DESC LIMIT 10;
|
||||
```
|
||||
3. Skim `crawlers/PIPELINE.md` for module-level detail.
|
||||
4. Open one of the n8n workflow JSONs (e.g.
|
||||
`n8n/workflows/3_daily_enrichment.json`) to see the graph shape. You don't
|
||||
need n8n running to read them.
|
||||
|
||||
## Ground rules
|
||||
|
||||
- **Never write to `funeral_brand.verified` or `funeral_brand.hidden`.**
|
||||
Those flip states belong to the admin review flow, which is a separate
|
||||
frontend project. If something in the pipeline needs to signal "this is
|
||||
ready to review", use `enrichment_status = 'complete'` and a good
|
||||
`listing_tier`, not these columns.
|
||||
- **Don't use Gathered Here as a source of truth.** It's a competitor.
|
||||
`crawlers/crawl_gathered_here.py` is historical tooling kept for reference —
|
||||
it's not part of the active workflows. Enrichment must come from the
|
||||
provider's own website or regulatory disclosure PDFs.
|
||||
- **`listing_tier` is derived, not authored.** `compute_tiers.py` recomputes
|
||||
it after every enrichment pass from package/inclusion data. Don't set it
|
||||
manually.
|
||||
- **Pipeline only owns a subset of columns.** The mapping is in
|
||||
[n8n/PROCESS.md](n8n/PROCESS.md) under "Columns the pipeline writes vs
|
||||
leaves alone". Branding, `funeral_home_id`, and admin flags are all
|
||||
out-of-scope.
|
||||
|
||||
## API keys and costs
|
||||
|
||||
You will need your own keys — none are included.
|
||||
|
||||
| Service | Required? | Cost | Where |
|
||||
|---|---|---|---|
|
||||
| Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev |
|
||||
| ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices |
|
||||
| Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com |
|
||||
|
||||
Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets
|
||||
saved into `source_record` for manual inspection. You just don't get the
|
||||
structured `package` / `package_inclusion` output.
|
||||
|
||||
```bash
|
||||
cp crawlers/config.example.json crawlers/config.json
|
||||
# edit crawlers/config.json
|
||||
```
|
||||
|
||||
## What's in play
|
||||
|
||||
**Working and tested:**
|
||||
- Three source crawlers (VIC Register, Funerals Australia, NFDA)
|
||||
- Cross-source dedup (fuzzy name + postcode + ABN)
|
||||
- Website discovery (Serper + DDG fallback + URL guessing)
|
||||
- ABN validation
|
||||
- Pricing page discovery and text extraction
|
||||
- Four n8n workflow JSONs
|
||||
|
||||
**Open work (any of these is a reasonable pickup):**
|
||||
- Google Places integration for location rating / `place_id` / richer address
|
||||
- Playwright-based enrichment for JS-rendered sites (currently fails on ~37%
|
||||
of sites that SSR nothing useful)
|
||||
- Admin review UI — needs design input, not just code
|
||||
- Stale-package cleanup in the monthly refresh workflow (new packages get
|
||||
inserted alongside old; no tombstoning yet)
|
||||
- PDF pricing extraction for InvoCare-style sites (PDF links are captured in
|
||||
`source_record.raw_data.pdf_links` but not yet parsed)
|
||||
|
||||
None of these are urgent. Ask before starting on the admin UI — it's coupled
|
||||
to the main platform and needs alignment.
|
||||
|
||||
## How to get unblocked
|
||||
|
||||
- **Technical questions:** ping Richie (contact below). Same-day turnaround on
|
||||
weekdays AEST.
|
||||
- **Access to the main platform / staging env:** not set up yet — this repo
|
||||
is self-contained and doesn't need it.
|
||||
- **Access to production data beyond `providers.db`:** ask. We can export
|
||||
anonymised slices if you need them.
|
||||
|
||||
## Workflow conventions
|
||||
|
||||
- Branch naming: `feature/<short-name>` or `fix/<short-name>`.
|
||||
- PRs welcome on Gitea; no required reviewers but please ping Richie before
|
||||
merging anything that changes schema or touches n8n workflow JSON.
|
||||
- Keep commits focused. If you need to restructure, do it in its own commit
|
||||
before the functional change.
|
||||
- No force-pushes to `main`.
|
||||
|
||||
## Contact
|
||||
|
||||
Richie — richie@tensordesign.com.au
|
||||
|
||||
Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl
|
||||
Reference in New Issue
Block a user