Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
8.6 KiB
Provider Discovery & Enrichment Pipeline
Architecture: Multi-Step Enrichment
The pipeline builds provider profiles progressively, never relying on competitor data. Each step adds richer detail from more authoritative sources.
STEP 1: DISCOVER STEP 2: FIND WEBSITE STEP 3: ENRICH
───────────────── ──────────────────── ──────────────
VIC Register ─────┐ ┌─ Fetch homepage
NFDA Directory ───┼─▶ Basic Google Places API ──┐ │ Find /pricing page
Funerals AU ──────┘ Provider ABN Lookup ─────────┼─▶ URL ──┤ Download PDFs
Record Search engines ─────┘ │ AI extract packages
└─▶ Structured data
name website URL description
address Google rating packages[]
phone Google reviews inclusions[]
email place_id pricing
state ABN (validated)
Step 1: Discovery (DONE — all modules built and tested)
Sources:
- VIC Consumer Affairs Register (796 records, VIC only) →
crawl_vic_register.py - Funerals Australia AJAX API (997 records, national) →
crawl_funerals_australia.py - NFDA WPSL API (209 records, national) →
crawl_nfda.py
Orchestrator: crawl_all.py
Deduplication: dedup.py (fuzzy name + postcode + ABN matching)
Output: ~1,463 unique providers with basic contact info.
Stored in: funeral_brand + location tables in database/providers.db.
Step 2: Website Discovery (DONE — module built and tested)
Module: discover_websites.py
Test result: 50% success rate on initial batch (DDG search + URL guessing)
Can be improved with Google Places API for higher hit rate.
For each provider that lacks a website URL:
2a. Serper.dev — Google search API (PRIMARY)
- Input: "{business name} {suburb} {state}"
- Returns: Google organic search results as JSON (title, link, snippet)
- Cost: 2,500 free queries (no CC needed), then $1/1K
- Covers our entire 1,463 providers for $0
- Filters out directories/aggregators, validates first result
- Module:
discover_websites.pywithsearch_serper()
2b. DuckDuckGo lite (FALLBACK)
- Free, no API key, but aggressive rate limiting
- Used when Serper key not configured or quota exhausted
- Module:
discover_websites.pywithsearch_ddg()
2c. URL pattern guessing (SUPPLEMENTARY)
- Generates candidate domains from business name (e.g. smithfunerals.com.au)
- HTTP HEAD to check if live, then validate content
- Module:
discover_websites.pywithguess_urls()
2d. ABN Lookup — Australian Business Register (ENRICHMENT)
- Input: business name + state
- Returns: ABN, entity status, registered state/postcode
- Cost: FREE (government API, requires GUID registration)
- Validates business is active, gives strongest dedup key
- Does NOT return website URLs
- Module:
lookup_abn.py - Register for GUID: https://abr.business.gov.au/Tools/WebServices
2e. Google Places API (OPTIONAL PREMIUM)
- Input: "{business name}, {suburb} {state}"
- Returns: website, rating, review count, place_id, formatted phone
- Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
- Best data quality but most expensive
- Not yet implemented — add when budget allows
2f. URL validation
- Fetch discovered URL, verify it loads
- Check page title/content mentions the business name
- Reject generic directories (yellowpages, truelocal, etc.)
- Mark confidence level: confirmed / probable / unverified
Step 3: Website Enrichment (DONE — module built and tested)
Module: enrich_websites.py
- Finds pricing pages via 20+ URL patterns + link following
- Extracts description from meta tags
- Extracts contact info (phone, email, address)
- Stores cleaned pricing page text for AI extraction
- Detects PDF links for PDF-based pricing extraction
For each provider with a confirmed website:
3a. Homepage crawl
- Fetch homepage HTML
- Extract: description/about text, contact details
- Look for links to pricing/services pages
3b. Pricing page discovery
Try common URL patterns: /pricing, /prices, /packages, /services, /our-services, /funeral-costs, /funeral-packages, /service-options, /price-list, /transparency
Also:
- Parse sitemap.xml if available
- Follow links containing "pric", "packag", "cost", "service"
- Check for PDF links on pricing pages
3c. AI extraction (Claude Haiku)
- Send pricing page HTML to Haiku
- Extract: package names, funeral types, prices, inclusions
- Map to known inclusion types where possible
- Return confidence score
3d. PDF extraction (for InvoCare-type sites)
- Download compliance PDFs
- Extract text (pdftotext or similar)
- Send to Haiku for structured extraction
- ~25% of sites are PDF-only for pricing
Listing Tiers
Providers are assigned a listing_tier based on data quality. Computed
automatically by compute_tiers.py after each enrichment run.
| Tier | Label | Criteria | Display |
|---|---|---|---|
verified |
Full partner | verified = true (signed up) |
Full branding, packages, arrangements |
priced |
Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail |
estimated |
Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns |
listed |
Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt |
Each tier below verified motivates the provider to sign up:
listed→ "Publish your pricing to attract more families"estimated→ "Add detailed breakdowns to stand out"priced→ "Sign up to enable online arrangements"
Enrichment Status Flow
pending ──▶ website_found ──▶ partial ──▶ complete
│ │ │
└──▶ no_website_found failed (retry later)
N8N Workflow Design
Workflow 1: Weekly Discovery
Cron → Run all source crawlers → Dedup into DB → Queue new providers
Workflow 2: Daily Website Discovery
Cron → Fetch providers with no website → Google Places lookup → ABN lookup → Search fallback → Update DB
Workflow 3: Daily Enrichment
Cron → Fetch providers with website but no packages → Crawl website → AI extract → Update DB
Workflow 4: Monthly Re-check
Cron → Re-crawl enriched providers → Update pricing if changed
Module Inventory
| Module | Purpose | N8N Workflow |
|---|---|---|
base.py |
Shared HTTP, DB, normalization utils | Used by all |
crawl_vic_register.py |
VIC government register (796 records) | Workflow 1 |
crawl_funerals_australia.py |
Funerals Australia API (997 records) | Workflow 1 |
crawl_nfda.py |
NFDA directory API (209 records) | Workflow 1 |
crawl_all.py |
Orchestrates all source crawlers | Workflow 1 |
dedup.py |
Cross-source dedup & merge engine | Workflow 1 |
discover_websites.py |
Find provider websites (Serper/DDG/guess) | Workflow 2 |
lookup_abn.py |
ABN validation via ABR API (free) | Workflow 2 |
enrich_websites.py |
Crawl provider sites, find pricing pages | Workflow 3 |
compute_tiers.py |
Compute listing_tier from data quality | After enrichment |
config.example.json |
API key template | — |
API Keys Required
| Service | Key | Cost | Register |
|---|---|---|---|
| Serper.dev | serper_api_key |
2,500 free, then $1/1K | https://serper.dev |
| ABR (ABN Lookup) | abr_guid |
Free | https://abr.business.gov.au/Tools/WebServices |
| Anthropic (Haiku) | anthropic_api_key |
~$2/full run | https://console.anthropic.com |
Quick Start
# 1. Configure API keys
cp config.example.json config.json
# Edit config.json with your keys
# 2. Reset database
cd ../database
sqlite3 providers.db < schema_sqlite.sql
# 3. Run full discovery pipeline
cd ../crawlers
python3 crawl_all.py # Step 1: Discover from registries
python3 dedup.py # Deduplicate across sources
python3 lookup_abn.py # Step 2a: Get ABNs (free)
python3 discover_websites.py # Step 2b: Find websites
python3 enrich_websites.py # Step 3: Crawl for pricing
python3 compute_tiers.py # Assign listing tiers
# Test mode (limited records)
python3 crawl_all.py --test
python3 discover_websites.py --limit=10 --state=VIC
python3 enrich_websites.py --limit=5