Initial commit: funeral provider discovery pipeline

Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
2026-04-24 10:27:08 +10:00
commit cc91427789
30 changed files with 4706 additions and 0 deletions
--- a/crawlers/PIPELINE.md
+++ b/crawlers/PIPELINE.md
@@ -0,0 +1,215 @@
+# Provider Discovery & Enrichment Pipeline
+
+## Architecture: Multi-Step Enrichment
+
+The pipeline builds provider profiles progressively, never relying on
+competitor data. Each step adds richer detail from more authoritative sources.
+
+```
+STEP 1: DISCOVER               STEP 2: FIND WEBSITE           STEP 3: ENRICH
+─────────────────               ────────────────────           ──────────────
+
+VIC Register ─────┐                                           ┌─ Fetch homepage
+NFDA Directory ───┼─▶ Basic     Google Places API ──┐         │  Find /pricing page
+Funerals AU ──────┘   Provider  ABN Lookup ─────────┼─▶ URL ──┤  Download PDFs
+                      Record    Search engines ─────┘         │  AI extract packages
+                                                              └─▶ Structured data
+                      name      website URL                      description
+                      address   Google rating                    packages[]
+                      phone     Google reviews                   inclusions[]
+                      email     place_id                         pricing
+                      state     ABN (validated)
+```
+
+## Step 1: Discovery (DONE — all modules built and tested)
+
+Sources:
+- VIC Consumer Affairs Register (796 records, VIC only) → `crawl_vic_register.py`
+- Funerals Australia AJAX API (997 records, national) → `crawl_funerals_australia.py`
+- NFDA WPSL API (209 records, national) → `crawl_nfda.py`
+
+Orchestrator: `crawl_all.py`
+Deduplication: `dedup.py` (fuzzy name + postcode + ABN matching)
+
+Output: ~1,463 unique providers with basic contact info.
+Stored in: funeral_brand + location tables in `database/providers.db`.
+
+## Step 2: Website Discovery (DONE — module built and tested)
+
+Module: `discover_websites.py`
+Test result: 50% success rate on initial batch (DDG search + URL guessing)
+Can be improved with Google Places API for higher hit rate.
+
+For each provider that lacks a website URL:
+
+### 2a. Serper.dev — Google search API (PRIMARY)
+- Input: "{business name} {suburb} {state}"
+- Returns: Google organic search results as JSON (title, link, snippet)
+- Cost: **2,500 free queries** (no CC needed), then $1/1K
+- Covers our entire 1,463 providers for $0
+- Filters out directories/aggregators, validates first result
+- Module: `discover_websites.py` with `search_serper()`
+
+### 2b. DuckDuckGo lite (FALLBACK)
+- Free, no API key, but aggressive rate limiting
+- Used when Serper key not configured or quota exhausted
+- Module: `discover_websites.py` with `search_ddg()`
+
+### 2c. URL pattern guessing (SUPPLEMENTARY)
+- Generates candidate domains from business name (e.g. smithfunerals.com.au)
+- HTTP HEAD to check if live, then validate content
+- Module: `discover_websites.py` with `guess_urls()`
+
+### 2d. ABN Lookup — Australian Business Register (ENRICHMENT)
+- Input: business name + state
+- Returns: ABN, entity status, registered state/postcode
+- Cost: **FREE** (government API, requires GUID registration)
+- Validates business is active, gives strongest dedup key
+- Does NOT return website URLs
+- Module: `lookup_abn.py`
+- Register for GUID: https://abr.business.gov.au/Tools/WebServices
+
+### 2e. Google Places API (OPTIONAL PREMIUM)
+- Input: "{business name}, {suburb} {state}"
+- Returns: website, rating, review count, place_id, formatted phone
+- Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
+- Best data quality but most expensive
+- Not yet implemented — add when budget allows
+
+### 2f. URL validation
+- Fetch discovered URL, verify it loads
+- Check page title/content mentions the business name
+- Reject generic directories (yellowpages, truelocal, etc.)
+- Mark confidence level: confirmed / probable / unverified
+
+## Step 3: Website Enrichment (DONE — module built and tested)
+
+Module: `enrich_websites.py`
+- Finds pricing pages via 20+ URL patterns + link following
+- Extracts description from meta tags
+- Extracts contact info (phone, email, address)
+- Stores cleaned pricing page text for AI extraction
+- Detects PDF links for PDF-based pricing extraction
+
+For each provider with a confirmed website:
+
+### 3a. Homepage crawl
+- Fetch homepage HTML
+- Extract: description/about text, contact details
+- Look for links to pricing/services pages
+
+### 3b. Pricing page discovery
+Try common URL patterns:
+  /pricing, /prices, /packages, /services, /our-services,
+  /funeral-costs, /funeral-packages, /service-options,
+  /price-list, /transparency
+
+Also:
+- Parse sitemap.xml if available
+- Follow links containing "pric", "packag", "cost", "service"
+- Check for PDF links on pricing pages
+
+### 3c. AI extraction (Claude Haiku)
+- Send pricing page HTML to Haiku
+- Extract: package names, funeral types, prices, inclusions
+- Map to known inclusion types where possible
+- Return confidence score
+
+### 3d. PDF extraction (for InvoCare-type sites)
+- Download compliance PDFs
+- Extract text (pdftotext or similar)
+- Send to Haiku for structured extraction
+- ~25% of sites are PDF-only for pricing
+
+## Listing Tiers
+
+Providers are assigned a `listing_tier` based on data quality. Computed
+automatically by `compute_tiers.py` after each enrichment run.
+
+| Tier | Label | Criteria | Display |
+|------|-------|----------|---------|
+| `verified` | Full partner | `verified = true` (signed up) | Full branding, packages, arrangements |
+| `priced` | Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail |
+| `estimated` | Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns |
+| `listed` | Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt |
+
+Each tier below `verified` motivates the provider to sign up:
+- `listed` → "Publish your pricing to attract more families"
+- `estimated` → "Add detailed breakdowns to stand out"
+- `priced` → "Sign up to enable online arrangements"
+
+## Enrichment Status Flow
+
+```
+pending ──▶ website_found ──▶ partial ──▶ complete
+   │              │               │
+   └──▶ no_website_found    failed (retry later)
+```
+
+## N8N Workflow Design
+
+### Workflow 1: Weekly Discovery
+Cron → Run all source crawlers → Dedup into DB → Queue new providers
+
+### Workflow 2: Daily Website Discovery
+Cron → Fetch providers with no website → Google Places lookup
+     → ABN lookup → Search fallback → Update DB
+
+### Workflow 3: Daily Enrichment
+Cron → Fetch providers with website but no packages
+     → Crawl website → AI extract → Update DB
+
+### Workflow 4: Monthly Re-check
+Cron → Re-crawl enriched providers → Update pricing if changed
+
+---
+
+## Module Inventory
+
+| Module | Purpose | N8N Workflow |
+|--------|---------|-------------|
+| `base.py` | Shared HTTP, DB, normalization utils | Used by all |
+| `crawl_vic_register.py` | VIC government register (796 records) | Workflow 1 |
+| `crawl_funerals_australia.py` | Funerals Australia API (997 records) | Workflow 1 |
+| `crawl_nfda.py` | NFDA directory API (209 records) | Workflow 1 |
+| `crawl_all.py` | Orchestrates all source crawlers | Workflow 1 |
+| `dedup.py` | Cross-source dedup & merge engine | Workflow 1 |
+| `discover_websites.py` | Find provider websites (Serper/DDG/guess) | Workflow 2 |
+| `lookup_abn.py` | ABN validation via ABR API (free) | Workflow 2 |
+| `enrich_websites.py` | Crawl provider sites, find pricing pages | Workflow 3 |
+| `compute_tiers.py` | Compute listing_tier from data quality | After enrichment |
+| `config.example.json` | API key template | — |
+
+## API Keys Required
+
+| Service | Key | Cost | Register |
+|---------|-----|------|----------|
+| Serper.dev | `serper_api_key` | 2,500 free, then $1/1K | https://serper.dev |
+| ABR (ABN Lookup) | `abr_guid` | Free | https://abr.business.gov.au/Tools/WebServices |
+| Anthropic (Haiku) | `anthropic_api_key` | ~$2/full run | https://console.anthropic.com |
+
+## Quick Start
+
+```bash
+# 1. Configure API keys
+cp config.example.json config.json
+# Edit config.json with your keys
+
+# 2. Reset database
+cd ../database
+sqlite3 providers.db < schema_sqlite.sql
+
+# 3. Run full discovery pipeline
+cd ../crawlers
+python3 crawl_all.py          # Step 1: Discover from registries
+python3 dedup.py              # Deduplicate across sources
+python3 lookup_abn.py         # Step 2a: Get ABNs (free)
+python3 discover_websites.py  # Step 2b: Find websites
+python3 enrich_websites.py    # Step 3: Crawl for pricing
+python3 compute_tiers.py      # Assign listing tiers
+
+# Test mode (limited records)
+python3 crawl_all.py --test
+python3 discover_websites.py --limit=10 --state=VIC
+python3 enrich_websites.py --limit=5
+```