Files
Provider-Crawl/crawlers/PIPELINE.md
Richie cc91427789 Initial commit: funeral provider discovery pipeline
Python crawlers for VIC Register, Funerals Australia, NFDA
n8n workflows for scheduled discovery and enrichment
SQLite schema and seeded dev database (1,463 providers)
End-to-end process documentation in n8n/PROCESS.md
2026-04-24 10:27:08 +10:00

8.6 KiB

Provider Discovery & Enrichment Pipeline

Architecture: Multi-Step Enrichment

The pipeline builds provider profiles progressively, never relying on competitor data. Each step adds richer detail from more authoritative sources.

STEP 1: DISCOVER               STEP 2: FIND WEBSITE           STEP 3: ENRICH
─────────────────               ────────────────────           ──────────────

VIC Register ─────┐                                           ┌─ Fetch homepage
NFDA Directory ───┼─▶ Basic     Google Places API ──┐         │  Find /pricing page
Funerals AU ──────┘   Provider  ABN Lookup ─────────┼─▶ URL ──┤  Download PDFs
                      Record    Search engines ─────┘         │  AI extract packages
                                                              └─▶ Structured data
                      name      website URL                      description
                      address   Google rating                    packages[]
                      phone     Google reviews                   inclusions[]
                      email     place_id                         pricing
                      state     ABN (validated)

Step 1: Discovery (DONE — all modules built and tested)

Sources:

  • VIC Consumer Affairs Register (796 records, VIC only) → crawl_vic_register.py
  • Funerals Australia AJAX API (997 records, national) → crawl_funerals_australia.py
  • NFDA WPSL API (209 records, national) → crawl_nfda.py

Orchestrator: crawl_all.py Deduplication: dedup.py (fuzzy name + postcode + ABN matching)

Output: ~1,463 unique providers with basic contact info. Stored in: funeral_brand + location tables in database/providers.db.

Step 2: Website Discovery (DONE — module built and tested)

Module: discover_websites.py Test result: 50% success rate on initial batch (DDG search + URL guessing) Can be improved with Google Places API for higher hit rate.

For each provider that lacks a website URL:

2a. Serper.dev — Google search API (PRIMARY)

  • Input: "{business name} {suburb} {state}"
  • Returns: Google organic search results as JSON (title, link, snippet)
  • Cost: 2,500 free queries (no CC needed), then $1/1K
  • Covers our entire 1,463 providers for $0
  • Filters out directories/aggregators, validates first result
  • Module: discover_websites.py with search_serper()

2b. DuckDuckGo lite (FALLBACK)

  • Free, no API key, but aggressive rate limiting
  • Used when Serper key not configured or quota exhausted
  • Module: discover_websites.py with search_ddg()

2c. URL pattern guessing (SUPPLEMENTARY)

  • Generates candidate domains from business name (e.g. smithfunerals.com.au)
  • HTTP HEAD to check if live, then validate content
  • Module: discover_websites.py with guess_urls()

2d. ABN Lookup — Australian Business Register (ENRICHMENT)

  • Input: business name + state
  • Returns: ABN, entity status, registered state/postcode
  • Cost: FREE (government API, requires GUID registration)
  • Validates business is active, gives strongest dedup key
  • Does NOT return website URLs
  • Module: lookup_abn.py
  • Register for GUID: https://abr.business.gov.au/Tools/WebServices

2e. Google Places API (OPTIONAL PREMIUM)

  • Input: "{business name}, {suburb} {state}"
  • Returns: website, rating, review count, place_id, formatted phone
  • Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
  • Best data quality but most expensive
  • Not yet implemented — add when budget allows

2f. URL validation

  • Fetch discovered URL, verify it loads
  • Check page title/content mentions the business name
  • Reject generic directories (yellowpages, truelocal, etc.)
  • Mark confidence level: confirmed / probable / unverified

Step 3: Website Enrichment (DONE — module built and tested)

Module: enrich_websites.py

  • Finds pricing pages via 20+ URL patterns + link following
  • Extracts description from meta tags
  • Extracts contact info (phone, email, address)
  • Stores cleaned pricing page text for AI extraction
  • Detects PDF links for PDF-based pricing extraction

For each provider with a confirmed website:

3a. Homepage crawl

  • Fetch homepage HTML
  • Extract: description/about text, contact details
  • Look for links to pricing/services pages

3b. Pricing page discovery

Try common URL patterns: /pricing, /prices, /packages, /services, /our-services, /funeral-costs, /funeral-packages, /service-options, /price-list, /transparency

Also:

  • Parse sitemap.xml if available
  • Follow links containing "pric", "packag", "cost", "service"
  • Check for PDF links on pricing pages

3c. AI extraction (Claude Haiku)

  • Send pricing page HTML to Haiku
  • Extract: package names, funeral types, prices, inclusions
  • Map to known inclusion types where possible
  • Return confidence score

3d. PDF extraction (for InvoCare-type sites)

  • Download compliance PDFs
  • Extract text (pdftotext or similar)
  • Send to Haiku for structured extraction
  • ~25% of sites are PDF-only for pricing

Listing Tiers

Providers are assigned a listing_tier based on data quality. Computed automatically by compute_tiers.py after each enrichment run.

Tier Label Criteria Display
verified Full partner verified = true (signed up) Full branding, packages, arrangements
priced Full pricing 2+ packages with itemized inclusion prices Package comparison, line-item detail
estimated Some pricing At least 1 package with a total price Package prices shown, "Contact for details" on breakdowns
listed Contact only Name + location + phone, no pricing "Contact for pricing" CTA, upgrade prompt

Each tier below verified motivates the provider to sign up:

  • listed → "Publish your pricing to attract more families"
  • estimated → "Add detailed breakdowns to stand out"
  • priced → "Sign up to enable online arrangements"

Enrichment Status Flow

pending ──▶ website_found ──▶ partial ──▶ complete
   │              │               │
   └──▶ no_website_found    failed (retry later)

N8N Workflow Design

Workflow 1: Weekly Discovery

Cron → Run all source crawlers → Dedup into DB → Queue new providers

Workflow 2: Daily Website Discovery

Cron → Fetch providers with no website → Google Places lookup → ABN lookup → Search fallback → Update DB

Workflow 3: Daily Enrichment

Cron → Fetch providers with website but no packages → Crawl website → AI extract → Update DB

Workflow 4: Monthly Re-check

Cron → Re-crawl enriched providers → Update pricing if changed


Module Inventory

Module Purpose N8N Workflow
base.py Shared HTTP, DB, normalization utils Used by all
crawl_vic_register.py VIC government register (796 records) Workflow 1
crawl_funerals_australia.py Funerals Australia API (997 records) Workflow 1
crawl_nfda.py NFDA directory API (209 records) Workflow 1
crawl_all.py Orchestrates all source crawlers Workflow 1
dedup.py Cross-source dedup & merge engine Workflow 1
discover_websites.py Find provider websites (Serper/DDG/guess) Workflow 2
lookup_abn.py ABN validation via ABR API (free) Workflow 2
enrich_websites.py Crawl provider sites, find pricing pages Workflow 3
compute_tiers.py Compute listing_tier from data quality After enrichment
config.example.json API key template

API Keys Required

Service Key Cost Register
Serper.dev serper_api_key 2,500 free, then $1/1K https://serper.dev
ABR (ABN Lookup) abr_guid Free https://abr.business.gov.au/Tools/WebServices
Anthropic (Haiku) anthropic_api_key ~$2/full run https://console.anthropic.com

Quick Start

# 1. Configure API keys
cp config.example.json config.json
# Edit config.json with your keys

# 2. Reset database
cd ../database
sqlite3 providers.db < schema_sqlite.sql

# 3. Run full discovery pipeline
cd ../crawlers
python3 crawl_all.py          # Step 1: Discover from registries
python3 dedup.py              # Deduplicate across sources
python3 lookup_abn.py         # Step 2a: Get ABNs (free)
python3 discover_websites.py  # Step 2b: Find websites
python3 enrich_websites.py    # Step 3: Crawl for pricing
python3 compute_tiers.py      # Assign listing tiers

# Test mode (limited records)
python3 crawl_all.py --test
python3 discover_websites.py --limit=10 --state=VIC
python3 enrich_websites.py --limit=5