Files
Provider-Crawl/ONBOARDING.md
2026-04-24 10:29:40 +10:00

5.3 KiB

Onboarding

Welcome — this doc is the human-facing context. See README.md for the repo tour and n8n/PROCESS.md for the authoritative end-to-end flow.

What this project is for

Funeral Arranger is a consumer platform that helps Australian families compare funeral directors and arrange services online. To be useful on day one, it needs broad coverage of providers — not just the ones who've signed up.

This pipeline solves that. It populates the platform with every funeral director we can find in Australia, pulling from public registers and the providers' own websites. All auto-discovered providers stay hidden and unverified until an admin manually reviews them; signed-up partners (verified = 1) are authoritative and the pipeline never touches them.

The goal is a long tail of "listed" and "estimated" providers that motivates real sign-ups — a provider who sees their competitor with full pricing on our site has an incentive to claim and upgrade their own listing.

Your first hour

  1. Read n8n/PROCESS.md end to end. It's the single most useful doc in the repo.
  2. Poke at the included dev database:
    sqlite3 database/providers.db
    .tables
    SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier;
    SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10;
    SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items
      FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id
      GROUP BY p.id ORDER BY line_items DESC LIMIT 10;
    
  3. Skim crawlers/PIPELINE.md for module-level detail.
  4. Open one of the n8n workflow JSONs (e.g. n8n/workflows/3_daily_enrichment.json) to see the graph shape. You don't need n8n running to read them.

Ground rules

  • Never write to funeral_brand.verified or funeral_brand.hidden. Those flip states belong to the admin review flow, which is a separate frontend project. If something in the pipeline needs to signal "this is ready to review", use enrichment_status = 'complete' and a good listing_tier, not these columns.
  • Don't use Gathered Here as a source of truth. It's a competitor. crawlers/crawl_gathered_here.py is historical tooling kept for reference — it's not part of the active workflows. Enrichment must come from the provider's own website or regulatory disclosure PDFs.
  • listing_tier is derived, not authored. compute_tiers.py recomputes it after every enrichment pass from package/inclusion data. Don't set it manually.
  • Pipeline only owns a subset of columns. The mapping is in n8n/PROCESS.md under "Columns the pipeline writes vs leaves alone". Branding, funeral_home_id, and admin flags are all out-of-scope.

API keys and costs

You will need your own keys — none are included.

Service Required? Cost Where
Serper.dev Yes, for website discovery 2,500 free/mo, then $1/1K https://serper.dev
ABR (ABN lookup) Optional, free Free (register for a GUID) https://abr.business.gov.au/Tools/WebServices
Anthropic (Claude Haiku) Only for AI pricing extraction ~$2 for a full 1,463-provider pass https://console.anthropic.com

Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets saved into source_record for manual inspection. You just don't get the structured package / package_inclusion output.

cp crawlers/config.example.json crawlers/config.json
# edit crawlers/config.json

What's in play

Working and tested:

  • Three source crawlers (VIC Register, Funerals Australia, NFDA)
  • Cross-source dedup (fuzzy name + postcode + ABN)
  • Website discovery (Serper + DDG fallback + URL guessing)
  • ABN validation
  • Pricing page discovery and text extraction
  • Four n8n workflow JSONs

Open work (any of these is a reasonable pickup):

  • Google Places integration for location rating / place_id / richer address
  • Playwright-based enrichment for JS-rendered sites (currently fails on ~37% of sites that SSR nothing useful)
  • Admin review UI — needs design input, not just code
  • Stale-package cleanup in the monthly refresh workflow (new packages get inserted alongside old; no tombstoning yet)
  • PDF pricing extraction for InvoCare-style sites (PDF links are captured in source_record.raw_data.pdf_links but not yet parsed)

None of these are urgent. Ask before starting on the admin UI — it's coupled to the main platform and needs alignment.

How to get unblocked

  • Technical questions: ping Richie (contact below). Same-day turnaround on weekdays AEST.
  • Access to the main platform / staging env: not set up yet — this repo is self-contained and doesn't need it.
  • Access to production data beyond providers.db: ask. We can export anonymised slices if you need them.

Workflow conventions

  • Branch naming: feature/<short-name> or fix/<short-name>.
  • PRs welcome on Gitea; no required reviewers but please ping Richie before merging anything that changes schema or touches n8n workflow JSON.
  • Keep commits focused. If you need to restructure, do it in its own commit before the functional change.
  • No force-pushes to main.

Contact

Richie — richie@tensordesign.com.au

Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl