126 lines
5.3 KiB
Markdown
126 lines
5.3 KiB
Markdown
# Onboarding
|
|
|
|
Welcome — this doc is the human-facing context. See [README.md](README.md) for
|
|
the repo tour and [n8n/PROCESS.md](n8n/PROCESS.md) for the authoritative
|
|
end-to-end flow.
|
|
|
|
## What this project is for
|
|
|
|
Funeral Arranger is a consumer platform that helps Australian families compare
|
|
funeral directors and arrange services online. To be useful on day one, it
|
|
needs broad coverage of providers — not just the ones who've signed up.
|
|
|
|
This pipeline solves that. It populates the platform with every funeral
|
|
director we can find in Australia, pulling from public registers and the
|
|
providers' own websites. All auto-discovered providers stay **hidden and
|
|
unverified** until an admin manually reviews them; signed-up partners
|
|
(`verified = 1`) are authoritative and the pipeline never touches them.
|
|
|
|
The goal is a long tail of "listed" and "estimated" providers that motivates
|
|
real sign-ups — a provider who sees their competitor with full pricing on our
|
|
site has an incentive to claim and upgrade their own listing.
|
|
|
|
## Your first hour
|
|
|
|
1. Read [n8n/PROCESS.md](n8n/PROCESS.md) end to end. It's the single most
|
|
useful doc in the repo.
|
|
2. Poke at the included dev database:
|
|
```bash
|
|
sqlite3 database/providers.db
|
|
.tables
|
|
SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier;
|
|
SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10;
|
|
SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items
|
|
FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id
|
|
GROUP BY p.id ORDER BY line_items DESC LIMIT 10;
|
|
```
|
|
3. Skim `crawlers/PIPELINE.md` for module-level detail.
|
|
4. Open one of the n8n workflow JSONs (e.g.
|
|
`n8n/workflows/3_daily_enrichment.json`) to see the graph shape. You don't
|
|
need n8n running to read them.
|
|
|
|
## Ground rules
|
|
|
|
- **Never write to `funeral_brand.verified` or `funeral_brand.hidden`.**
|
|
Those flip states belong to the admin review flow, which is a separate
|
|
frontend project. If something in the pipeline needs to signal "this is
|
|
ready to review", use `enrichment_status = 'complete'` and a good
|
|
`listing_tier`, not these columns.
|
|
- **Don't use Gathered Here as a source of truth.** It's a competitor.
|
|
`crawlers/crawl_gathered_here.py` is historical tooling kept for reference —
|
|
it's not part of the active workflows. Enrichment must come from the
|
|
provider's own website or regulatory disclosure PDFs.
|
|
- **`listing_tier` is derived, not authored.** `compute_tiers.py` recomputes
|
|
it after every enrichment pass from package/inclusion data. Don't set it
|
|
manually.
|
|
- **Pipeline only owns a subset of columns.** The mapping is in
|
|
[n8n/PROCESS.md](n8n/PROCESS.md) under "Columns the pipeline writes vs
|
|
leaves alone". Branding, `funeral_home_id`, and admin flags are all
|
|
out-of-scope.
|
|
|
|
## API keys and costs
|
|
|
|
You will need your own keys — none are included.
|
|
|
|
| Service | Required? | Cost | Where |
|
|
|---|---|---|---|
|
|
| Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev |
|
|
| ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices |
|
|
| Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com |
|
|
|
|
Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets
|
|
saved into `source_record` for manual inspection. You just don't get the
|
|
structured `package` / `package_inclusion` output.
|
|
|
|
```bash
|
|
cp crawlers/config.example.json crawlers/config.json
|
|
# edit crawlers/config.json
|
|
```
|
|
|
|
## What's in play
|
|
|
|
**Working and tested:**
|
|
- Three source crawlers (VIC Register, Funerals Australia, NFDA)
|
|
- Cross-source dedup (fuzzy name + postcode + ABN)
|
|
- Website discovery (Serper + DDG fallback + URL guessing)
|
|
- ABN validation
|
|
- Pricing page discovery and text extraction
|
|
- Four n8n workflow JSONs
|
|
|
|
**Open work (any of these is a reasonable pickup):**
|
|
- Google Places integration for location rating / `place_id` / richer address
|
|
- Playwright-based enrichment for JS-rendered sites (currently fails on ~37%
|
|
of sites that SSR nothing useful)
|
|
- Admin review UI — needs design input, not just code
|
|
- Stale-package cleanup in the monthly refresh workflow (new packages get
|
|
inserted alongside old; no tombstoning yet)
|
|
- PDF pricing extraction for InvoCare-style sites (PDF links are captured in
|
|
`source_record.raw_data.pdf_links` but not yet parsed)
|
|
|
|
None of these are urgent. Ask before starting on the admin UI — it's coupled
|
|
to the main platform and needs alignment.
|
|
|
|
## How to get unblocked
|
|
|
|
- **Technical questions:** ping Richie (contact below). Same-day turnaround on
|
|
weekdays AEST.
|
|
- **Access to the main platform / staging env:** not set up yet — this repo
|
|
is self-contained and doesn't need it.
|
|
- **Access to production data beyond `providers.db`:** ask. We can export
|
|
anonymised slices if you need them.
|
|
|
|
## Workflow conventions
|
|
|
|
- Branch naming: `feature/<short-name>` or `fix/<short-name>`.
|
|
- PRs welcome on Gitea; no required reviewers but please ping Richie before
|
|
merging anything that changes schema or touches n8n workflow JSON.
|
|
- Keep commits focused. If you need to restructure, do it in its own commit
|
|
before the functional change.
|
|
- No force-pushes to `main`.
|
|
|
|
## Contact
|
|
|
|
Richie — richie@tensordesign.com.au
|
|
|
|
Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl
|