# Onboarding Welcome — this doc is the human-facing context. See [README.md](README.md) for the repo tour and [n8n/PROCESS.md](n8n/PROCESS.md) for the authoritative end-to-end flow. ## What this project is for Funeral Arranger is a consumer platform that helps Australian families compare funeral directors and arrange services online. To be useful on day one, it needs broad coverage of providers — not just the ones who've signed up. This pipeline solves that. It populates the platform with every funeral director we can find in Australia, pulling from public registers and the providers' own websites. All auto-discovered providers stay **hidden and unverified** until an admin manually reviews them; signed-up partners (`verified = 1`) are authoritative and the pipeline never touches them. The goal is a long tail of "listed" and "estimated" providers that motivates real sign-ups — a provider who sees their competitor with full pricing on our site has an incentive to claim and upgrade their own listing. ## Your first hour 1. Read [n8n/PROCESS.md](n8n/PROCESS.md) end to end. It's the single most useful doc in the repo. 2. Poke at the included dev database: ```bash sqlite3 database/providers.db .tables SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier; SELECT title, website, listing_tier FROM funeral_brand WHERE listing_tier='priced' LIMIT 10; SELECT p.title, p.funeral_type, COUNT(pi.id) AS line_items FROM package p LEFT JOIN package_inclusion pi ON pi.package_id = p.id GROUP BY p.id ORDER BY line_items DESC LIMIT 10; ``` 3. Skim `crawlers/PIPELINE.md` for module-level detail. 4. Open one of the n8n workflow JSONs (e.g. `n8n/workflows/3_daily_enrichment.json`) to see the graph shape. You don't need n8n running to read them. ## Ground rules - **Never write to `funeral_brand.verified` or `funeral_brand.hidden`.** Those flip states belong to the admin review flow, which is a separate frontend project. If something in the pipeline needs to signal "this is ready to review", use `enrichment_status = 'complete'` and a good `listing_tier`, not these columns. - **Don't use Gathered Here as a source of truth.** It's a competitor. `crawlers/crawl_gathered_here.py` is historical tooling kept for reference — it's not part of the active workflows. Enrichment must come from the provider's own website or regulatory disclosure PDFs. - **`listing_tier` is derived, not authored.** `compute_tiers.py` recomputes it after every enrichment pass from package/inclusion data. Don't set it manually. - **Pipeline only owns a subset of columns.** The mapping is in [n8n/PROCESS.md](n8n/PROCESS.md) under "Columns the pipeline writes vs leaves alone". Branding, `funeral_home_id`, and admin flags are all out-of-scope. ## API keys and costs You will need your own keys — none are included. | Service | Required? | Cost | Where | |---|---|---|---| | Serper.dev | Yes, for website discovery | 2,500 free/mo, then $1/1K | https://serper.dev | | ABR (ABN lookup) | Optional, free | Free (register for a GUID) | https://abr.business.gov.au/Tools/WebServices | | Anthropic (Claude Haiku) | Only for AI pricing extraction | ~$2 for a full 1,463-provider pass | https://console.anthropic.com | Without Anthropic, Phase A of Workflow 3 still runs — pricing page text gets saved into `source_record` for manual inspection. You just don't get the structured `package` / `package_inclusion` output. ```bash cp crawlers/config.example.json crawlers/config.json # edit crawlers/config.json ``` ## What's in play **Working and tested:** - Three source crawlers (VIC Register, Funerals Australia, NFDA) - Cross-source dedup (fuzzy name + postcode + ABN) - Website discovery (Serper + DDG fallback + URL guessing) - ABN validation - Pricing page discovery and text extraction - Four n8n workflow JSONs **Open work (any of these is a reasonable pickup):** - Google Places integration for location rating / `place_id` / richer address - Playwright-based enrichment for JS-rendered sites (currently fails on ~37% of sites that SSR nothing useful) - Admin review UI — needs design input, not just code - Stale-package cleanup in the monthly refresh workflow (new packages get inserted alongside old; no tombstoning yet) - PDF pricing extraction for InvoCare-style sites (PDF links are captured in `source_record.raw_data.pdf_links` but not yet parsed) None of these are urgent. Ask before starting on the admin UI — it's coupled to the main platform and needs alignment. ## How to get unblocked - **Technical questions:** ping Richie (contact below). Same-day turnaround on weekdays AEST. - **Access to the main platform / staging env:** not set up yet — this repo is self-contained and doesn't need it. - **Access to production data beyond `providers.db`:** ask. We can export anonymised slices if you need them. ## Workflow conventions - Branch naming: `feature/` or `fix/`. - PRs welcome on Gitea; no required reviewers but please ping Richie before merging anything that changes schema or touches n8n workflow JSON. - Keep commits focused. If you need to restructure, do it in its own commit before the functional change. - No force-pushes to `main`. ## Contact Richie — richie@tensordesign.com.au Gitea: https://git.tensordesign.com.au/richie/Provider-Crawl