5.3 KiB
Onboarding
Welcome — this doc is the human-facing context. See README.md for the repo tour and n8n/PROCESS.md for the authoritative end-to-end flow.
What this project is for
Funeral Arranger is a consumer platform that helps Australian families compare funeral directors and arrange services online. To be useful on day one, it needs broad coverage of providers — not just the ones who've signed up.
This pipeline solves that. It populates the platform with every funeral
director we can find in Australia, pulling from public registers and the
providers' own websites. All auto-discovered providers stay hidden and
unverified until an admin manually reviews them; signed-up partners
(verified = 1) are authoritative and the pipeline never touches them.
The goal is a long tail of "listed" and "estimated" providers that motivates real sign-ups — a provider who sees their competitor with full pricing on our site has an incentive to claim and upgrade their own listing.
Your first hour
- Read n8n/PROCESS.md end to end. It's the single most useful doc in the repo.
- Poke at the included dev database:
sqlite3 database/providers.db .tables SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier; SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10; SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id GROUP BY p.id ORDER BY line_items DESC LIMIT 10; - Skim
crawlers/PIPELINE.mdfor module-level detail. - Open one of the n8n workflow JSONs (e.g.
n8n/workflows/3_daily_enrichment.json) to see the graph shape. You don't need n8n running to read them.
Ground rules
- Never write to
funeral_brand.verifiedorfuneral_brand.hidden. Those flip states belong to the admin review flow, which is a separate frontend project. If something in the pipeline needs to signal "this is ready to review", useenrichment_status = 'complete'and a goodlisting_tier, not these columns. - Don't use Gathered Here as a source of truth. It's a competitor.
crawlers/crawl_gathered_here.pyis historical tooling kept for reference — it's not part of the active workflows. Enrichment must come from the provider's own website or regulatory disclosure PDFs. listing_tieris derived, not authored.compute_tiers.pyrecomputes it after every enrichment pass from package/inclusion data. Don't set it manually.- Pipeline only owns a subset of columns. The mapping is in
n8n/PROCESS.md under "Columns the pipeline writes vs
leaves alone". Branding,
funeral_home_id, and admin flags are all out-of-scope.
API keys and costs
You will need your own keys — none are included.
| Service | Required? | Cost | Where |
|---|---|---|---|
| Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev |
| ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices |
| Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com |
Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets
saved into source_record for manual inspection. You just don't get the
structured package / package_inclusion output.
cp crawlers/config.example.json crawlers/config.json
# edit crawlers/config.json
What's in play
Working and tested:
- Three source crawlers (VIC Register, Funerals Australia, NFDA)
- Cross-source dedup (fuzzy name + postcode + ABN)
- Website discovery (Serper + DDG fallback + URL guessing)
- ABN validation
- Pricing page discovery and text extraction
- Four n8n workflow JSONs
Open work (any of these is a reasonable pickup):
- Google Places integration for location rating /
place_id/ richer address - Playwright-based enrichment for JS-rendered sites (currently fails on ~37% of sites that SSR nothing useful)
- Admin review UI — needs design input, not just code
- Stale-package cleanup in the monthly refresh workflow (new packages get inserted alongside old; no tombstoning yet)
- PDF pricing extraction for InvoCare-style sites (PDF links are captured in
source_record.raw_data.pdf_linksbut not yet parsed)
None of these are urgent. Ask before starting on the admin UI — it's coupled to the main platform and needs alignment.
How to get unblocked
- Technical questions: ping Richie (contact below). Same-day turnaround on weekdays AEST.
- Access to the main platform / staging env: not set up yet — this repo is self-contained and doesn't need it.
- Access to production data beyond
providers.db: ask. We can export anonymised slices if you need them.
Workflow conventions
- Branch naming:
feature/<short-name>orfix/<short-name>. - PRs welcome on Gitea; no required reviewers but please ping Richie before merging anything that changes schema or touches n8n workflow JSON.
- Keep commits focused. If you need to restructure, do it in its own commit before the functional change.
- No force-pushes to
main.
Contact
Richie — richie@tensordesign.com.au
Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl