Initial commit: funeral provider discovery pipeline
Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
This commit is contained in:
215
crawlers/PIPELINE.md
Normal file
215
crawlers/PIPELINE.md
Normal file
@@ -0,0 +1,215 @@
|
||||
# Provider Discovery & Enrichment Pipeline
|
||||
|
||||
## Architecture: Multi-Step Enrichment
|
||||
|
||||
The pipeline builds provider profiles progressively, never relying on
|
||||
competitor data. Each step adds richer detail from more authoritative sources.
|
||||
|
||||
```
|
||||
STEP 1: DISCOVER STEP 2: FIND WEBSITE STEP 3: ENRICH
|
||||
───────────────── ──────────────────── ──────────────
|
||||
|
||||
VIC Register ─────┐ ┌─ Fetch homepage
|
||||
NFDA Directory ───┼─▶ Basic Google Places API ──┐ │ Find /pricing page
|
||||
Funerals AU ──────┘ Provider ABN Lookup ─────────┼─▶ URL ──┤ Download PDFs
|
||||
Record Search engines ─────┘ │ AI extract packages
|
||||
└─▶ Structured data
|
||||
name website URL description
|
||||
address Google rating packages[]
|
||||
phone Google reviews inclusions[]
|
||||
email place_id pricing
|
||||
state ABN (validated)
|
||||
```
|
||||
|
||||
## Step 1: Discovery (DONE — all modules built and tested)
|
||||
|
||||
Sources:
|
||||
- VIC Consumer Affairs Register (796 records, VIC only) → `crawl_vic_register.py`
|
||||
- Funerals Australia AJAX API (997 records, national) → `crawl_funerals_australia.py`
|
||||
- NFDA WPSL API (209 records, national) → `crawl_nfda.py`
|
||||
|
||||
Orchestrator: `crawl_all.py`
|
||||
Deduplication: `dedup.py` (fuzzy name + postcode + ABN matching)
|
||||
|
||||
Output: ~1,463 unique providers with basic contact info.
|
||||
Stored in: funeral_brand + location tables in `database/providers.db`.
|
||||
|
||||
## Step 2: Website Discovery (DONE — module built and tested)
|
||||
|
||||
Module: `discover_websites.py`
|
||||
Test result: 50% success rate on initial batch (DDG search + URL guessing)
|
||||
Can be improved with Google Places API for higher hit rate.
|
||||
|
||||
For each provider that lacks a website URL:
|
||||
|
||||
### 2a. Serper.dev — Google search API (PRIMARY)
|
||||
- Input: "{business name} {suburb} {state}"
|
||||
- Returns: Google organic search results as JSON (title, link, snippet)
|
||||
- Cost: **2,500 free queries** (no CC needed), then $1/1K
|
||||
- Covers our entire 1,463 providers for $0
|
||||
- Filters out directories/aggregators, validates first result
|
||||
- Module: `discover_websites.py` with `search_serper()`
|
||||
|
||||
### 2b. DuckDuckGo lite (FALLBACK)
|
||||
- Free, no API key, but aggressive rate limiting
|
||||
- Used when Serper key not configured or quota exhausted
|
||||
- Module: `discover_websites.py` with `search_ddg()`
|
||||
|
||||
### 2c. URL pattern guessing (SUPPLEMENTARY)
|
||||
- Generates candidate domains from business name (e.g. smithfunerals.com.au)
|
||||
- HTTP HEAD to check if live, then validate content
|
||||
- Module: `discover_websites.py` with `guess_urls()`
|
||||
|
||||
### 2d. ABN Lookup — Australian Business Register (ENRICHMENT)
|
||||
- Input: business name + state
|
||||
- Returns: ABN, entity status, registered state/postcode
|
||||
- Cost: **FREE** (government API, requires GUID registration)
|
||||
- Validates business is active, gives strongest dedup key
|
||||
- Does NOT return website URLs
|
||||
- Module: `lookup_abn.py`
|
||||
- Register for GUID: https://abr.business.gov.au/Tools/WebServices
|
||||
|
||||
### 2e. Google Places API (OPTIONAL PREMIUM)
|
||||
- Input: "{business name}, {suburb} {state}"
|
||||
- Returns: website, rating, review count, place_id, formatted phone
|
||||
- Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
|
||||
- Best data quality but most expensive
|
||||
- Not yet implemented — add when budget allows
|
||||
|
||||
### 2f. URL validation
|
||||
- Fetch discovered URL, verify it loads
|
||||
- Check page title/content mentions the business name
|
||||
- Reject generic directories (yellowpages, truelocal, etc.)
|
||||
- Mark confidence level: confirmed / probable / unverified
|
||||
|
||||
## Step 3: Website Enrichment (DONE — module built and tested)
|
||||
|
||||
Module: `enrich_websites.py`
|
||||
- Finds pricing pages via 20+ URL patterns + link following
|
||||
- Extracts description from meta tags
|
||||
- Extracts contact info (phone, email, address)
|
||||
- Stores cleaned pricing page text for AI extraction
|
||||
- Detects PDF links for PDF-based pricing extraction
|
||||
|
||||
For each provider with a confirmed website:
|
||||
|
||||
### 3a. Homepage crawl
|
||||
- Fetch homepage HTML
|
||||
- Extract: description/about text, contact details
|
||||
- Look for links to pricing/services pages
|
||||
|
||||
### 3b. Pricing page discovery
|
||||
Try common URL patterns:
|
||||
/pricing, /prices, /packages, /services, /our-services,
|
||||
/funeral-costs, /funeral-packages, /service-options,
|
||||
/price-list, /transparency
|
||||
|
||||
Also:
|
||||
- Parse sitemap.xml if available
|
||||
- Follow links containing "pric", "packag", "cost", "service"
|
||||
- Check for PDF links on pricing pages
|
||||
|
||||
### 3c. AI extraction (Claude Haiku)
|
||||
- Send pricing page HTML to Haiku
|
||||
- Extract: package names, funeral types, prices, inclusions
|
||||
- Map to known inclusion types where possible
|
||||
- Return confidence score
|
||||
|
||||
### 3d. PDF extraction (for InvoCare-type sites)
|
||||
- Download compliance PDFs
|
||||
- Extract text (pdftotext or similar)
|
||||
- Send to Haiku for structured extraction
|
||||
- ~25% of sites are PDF-only for pricing
|
||||
|
||||
## Listing Tiers
|
||||
|
||||
Providers are assigned a `listing_tier` based on data quality. Computed
|
||||
automatically by `compute_tiers.py` after each enrichment run.
|
||||
|
||||
| Tier | Label | Criteria | Display |
|
||||
|------|-------|----------|---------|
|
||||
| `verified` | Full partner | `verified = true` (signed up) | Full branding, packages, arrangements |
|
||||
| `priced` | Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail |
|
||||
| `estimated` | Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns |
|
||||
| `listed` | Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt |
|
||||
|
||||
Each tier below `verified` motivates the provider to sign up:
|
||||
- `listed` → "Publish your pricing to attract more families"
|
||||
- `estimated` → "Add detailed breakdowns to stand out"
|
||||
- `priced` → "Sign up to enable online arrangements"
|
||||
|
||||
## Enrichment Status Flow
|
||||
|
||||
```
|
||||
pending ──▶ website_found ──▶ partial ──▶ complete
|
||||
│ │ │
|
||||
└──▶ no_website_found failed (retry later)
|
||||
```
|
||||
|
||||
## N8N Workflow Design
|
||||
|
||||
### Workflow 1: Weekly Discovery
|
||||
Cron → Run all source crawlers → Dedup into DB → Queue new providers
|
||||
|
||||
### Workflow 2: Daily Website Discovery
|
||||
Cron → Fetch providers with no website → Google Places lookup
|
||||
→ ABN lookup → Search fallback → Update DB
|
||||
|
||||
### Workflow 3: Daily Enrichment
|
||||
Cron → Fetch providers with website but no packages
|
||||
→ Crawl website → AI extract → Update DB
|
||||
|
||||
### Workflow 4: Monthly Re-check
|
||||
Cron → Re-crawl enriched providers → Update pricing if changed
|
||||
|
||||
---
|
||||
|
||||
## Module Inventory
|
||||
|
||||
| Module | Purpose | N8N Workflow |
|
||||
|--------|---------|-------------|
|
||||
| `base.py` | Shared HTTP, DB, normalization utils | Used by all |
|
||||
| `crawl_vic_register.py` | VIC government register (796 records) | Workflow 1 |
|
||||
| `crawl_funerals_australia.py` | Funerals Australia API (997 records) | Workflow 1 |
|
||||
| `crawl_nfda.py` | NFDA directory API (209 records) | Workflow 1 |
|
||||
| `crawl_all.py` | Orchestrates all source crawlers | Workflow 1 |
|
||||
| `dedup.py` | Cross-source dedup & merge engine | Workflow 1 |
|
||||
| `discover_websites.py` | Find provider websites (Serper/DDG/guess) | Workflow 2 |
|
||||
| `lookup_abn.py` | ABN validation via ABR API (free) | Workflow 2 |
|
||||
| `enrich_websites.py` | Crawl provider sites, find pricing pages | Workflow 3 |
|
||||
| `compute_tiers.py` | Compute listing_tier from data quality | After enrichment |
|
||||
| `config.example.json` | API key template | — |
|
||||
|
||||
## API Keys Required
|
||||
|
||||
| Service | Key | Cost | Register |
|
||||
|---------|-----|------|----------|
|
||||
| Serper.dev | `serper_api_key` | 2,500 free, then $1/1K | https://serper.dev |
|
||||
| ABR (ABN Lookup) | `abr_guid` | Free | https://abr.business.gov.au/Tools/WebServices |
|
||||
| Anthropic (Haiku) | `anthropic_api_key` | ~$2/full run | https://console.anthropic.com |
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Configure API keys
|
||||
cp config.example.json config.json
|
||||
# Edit config.json with your keys
|
||||
|
||||
# 2. Reset database
|
||||
cd ../database
|
||||
sqlite3 providers.db < schema_sqlite.sql
|
||||
|
||||
# 3. Run full discovery pipeline
|
||||
cd ../crawlers
|
||||
python3 crawl_all.py # Step 1: Discover from registries
|
||||
python3 dedup.py # Deduplicate across sources
|
||||
python3 lookup_abn.py # Step 2a: Get ABNs (free)
|
||||
python3 discover_websites.py # Step 2b: Find websites
|
||||
python3 enrich_websites.py # Step 3: Crawl for pricing
|
||||
python3 compute_tiers.py # Assign listing tiers
|
||||
|
||||
# Test mode (limited records)
|
||||
python3 crawl_all.py --test
|
||||
python3 discover_websites.py --limit=10 --state=VIC
|
||||
python3 enrich_websites.py --limit=5
|
||||
```
|
||||
Reference in New Issue
Block a user