Initial commit: funeral provider discovery pipeline

Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
2026-04-24 10:27:08 +10:00
commit cc91427789
30 changed files with 4706 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,98 @@
+# Provider Crawl
+
+Automated pipeline for discovering Australian funeral directors from public
+registries, finding their websites, and extracting pricing data. Feeds the
+Funeral Arranger platform with a seed of unverified providers that an admin
+reviews before publishing.
+
+## What's in this repo
+
+```
+crawlers/    Python modules — one per data source + shared pipeline utilities
+n8n/         n8n workflows that orchestrate the crawlers on a schedule
+database/    SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB
+```
+
+Three documents explain how it works, in increasing depth:
+
+1. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough
+   of the four workflows and how their output maps to database tables.
+   **Start here.**
+2. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the
+   Python modules, source-by-source notes, listing-tier logic.
+3. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with
+   Docker and import the workflow JSONs.
+
+## Quick start (local, no n8n)
+
+```bash
+# 1. Install Python deps (requests, beautifulsoup4, rapidfuzz, pdfplumber)
+python3 -m venv .venv && source .venv/bin/activate
+pip install requests beautifulsoup4 rapidfuzz pdfplumber
+
+# 2. Configure API keys
+cp crawlers/config.example.json crawlers/config.json
+# Edit crawlers/config.json and add your Serper key (free 2,500/month from serper.dev)
+# Optionally add an Anthropic API key for AI pricing extraction
+
+# 3. Inspect the included dev database (1,463 providers, 121 with pricing)
+sqlite3 database/providers.db ".tables"
+sqlite3 database/providers.db "SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier"
+
+# 4. Or start fresh and run the pipeline end-to-end
+rm database/providers.db
+sqlite3 database/providers.db < database/schema_sqlite.sql
+cd crawlers
+./run_overnight.sh       # equivalent to workflows 1–3 back to back
+```
+
+## API keys needed
+
+| Service | Purpose | Cost | Where |
+|---|---|---|---|
+| Serper.dev | Website discovery via Google search | 2,500 free/mo | https://serper.dev |
+| ABR (Australian Business Register) | ABN validation | Free (GUID required) | https://abr.business.gov.au/Tools/WebServices |
+| Anthropic | AI pricing extraction (Claude Haiku) | ~$2 / full run | https://console.anthropic.com |
+
+Only Serper is required to run the full pipeline. Anthropic is only needed for
+the AI extraction step in Workflow 3 (can be skipped — Python crawl still
+populates pricing text into `source_record` for manual review).
+
+## Database
+
+The included `database/providers.db` is a live SQLite snapshot from the last
+overnight run: 1,463 unique providers, 994 with websites, 121 with pricing.
+
+Schema files:
+
+- `database/schema_sqlite.sql` — SQLite schema, used for local dev
+- `database/schema.sql` — Postgres schema (production target)
+- `database/seed_verified.sql` — seed data for partner (`verified = 1`) providers
+- `database/PROVIDER-SCHEMA-SPEC.md` — schema commentary and design notes
+- `database/IMAGE-MAPPING.md` — asset conventions for verified providers
+
+Per `n8n/PROCESS.md`, the pipeline only writes to a defined subset of columns —
+`verified`, `hidden`, branding, and `funeral_home_id` are admin-only.
+
+## Status
+
+What's built and working:
+
+- All three source crawlers (VIC Register, Funerals Australia, NFDA)
+- Cross-source deduplication with fuzzy name/postcode/ABN matching
+- Website discovery via Serper + DDG + URL guessing
+- ABN lookup via ABR
+- Website enrichment: meta descriptions, pricing page discovery, PDF detection
+- Listing tier computation
+- Four n8n workflow JSONs ready to import
+
+What's not yet built (open to pick up):
+
+- Google Places API integration for richer location data (ratings, place_id)
+- Playwright-based enrichment for JS-rendered sites (~37% of sites)
+- Admin review UI for approving hidden providers before they go live
+- Stale package cleanup in the monthly refresh
+
+## Contact
+
+Richie — richie@tensordesign.com.au