Initial commit: funeral provider discovery pipeline

Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
2026-04-24 10:27:08 +10:00
commit cc91427789
30 changed files with 4706 additions and 0 deletions
--- a/crawlers/PIPELINE.md
+++ b/crawlers/PIPELINE.md
@@ -0,0 +1,215 @@
+# Provider Discovery & Enrichment Pipeline
+
+## Architecture: Multi-Step Enrichment
+
+The pipeline builds provider profiles progressively, never relying on
+competitor data. Each step adds richer detail from more authoritative sources.
+
+```
+STEP 1: DISCOVER               STEP 2: FIND WEBSITE           STEP 3: ENRICH
+─────────────────               ────────────────────           ──────────────
+
+VIC Register ─────┐                                           ┌─ Fetch homepage
+NFDA Directory ───┼─▶ Basic     Google Places API ──┐         │  Find /pricing page
+Funerals AU ──────┘   Provider  ABN Lookup ─────────┼─▶ URL ──┤  Download PDFs
+                      Record    Search engines ─────┘         │  AI extract packages
+                                                              └─▶ Structured data
+                      name      website URL                      description
+                      address   Google rating                    packages[]
+                      phone     Google reviews                   inclusions[]
+                      email     place_id                         pricing
+                      state     ABN (validated)
+```
+
+## Step 1: Discovery (DONE — all modules built and tested)
+
+Sources:
+- VIC Consumer Affairs Register (796 records, VIC only) → `crawl_vic_register.py`
+- Funerals Australia AJAX API (997 records, national) → `crawl_funerals_australia.py`
+- NFDA WPSL API (209 records, national) → `crawl_nfda.py`
+
+Orchestrator: `crawl_all.py`
+Deduplication: `dedup.py` (fuzzy name + postcode + ABN matching)
+
+Output: ~1,463 unique providers with basic contact info.
+Stored in: funeral_brand + location tables in `database/providers.db`.
+
+## Step 2: Website Discovery (DONE — module built and tested)
+
+Module: `discover_websites.py`
+Test result: 50% success rate on initial batch (DDG search + URL guessing)
+Can be improved with Google Places API for higher hit rate.
+
+For each provider that lacks a website URL:
+
+### 2a. Serper.dev — Google search API (PRIMARY)
+- Input: "{business name} {suburb} {state}"
+- Returns: Google organic search results as JSON (title, link, snippet)
+- Cost: **2,500 free queries** (no CC needed), then $1/1K
+- Covers our entire 1,463 providers for $0
+- Filters out directories/aggregators, validates first result
+- Module: `discover_websites.py` with `search_serper()`
+
+### 2b. DuckDuckGo lite (FALLBACK)
+- Free, no API key, but aggressive rate limiting
+- Used when Serper key not configured or quota exhausted
+- Module: `discover_websites.py` with `search_ddg()`
+
+### 2c. URL pattern guessing (SUPPLEMENTARY)
+- Generates candidate domains from business name (e.g. smithfunerals.com.au)
+- HTTP HEAD to check if live, then validate content
+- Module: `discover_websites.py` with `guess_urls()`
+
+### 2d. ABN Lookup — Australian Business Register (ENRICHMENT)
+- Input: business name + state
+- Returns: ABN, entity status, registered state/postcode
+- Cost: **FREE** (government API, requires GUID registration)
+- Validates business is active, gives strongest dedup key
+- Does NOT return website URLs
+- Module: `lookup_abn.py`
+- Register for GUID: https://abr.business.gov.au/Tools/WebServices
+
+### 2e. Google Places API (OPTIONAL PREMIUM)
+- Input: "{business name}, {suburb} {state}"
+- Returns: website, rating, review count, place_id, formatted phone
+- Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
+- Best data quality but most expensive
+- Not yet implemented — add when budget allows
+
+### 2f. URL validation
+- Fetch discovered URL, verify it loads
+- Check page title/content mentions the business name
+- Reject generic directories (yellowpages, truelocal, etc.)
+- Mark confidence level: confirmed / probable / unverified
+
+## Step 3: Website Enrichment (DONE — module built and tested)
+
+Module: `enrich_websites.py`
+- Finds pricing pages via 20+ URL patterns + link following
+- Extracts description from meta tags
+- Extracts contact info (phone, email, address)
+- Stores cleaned pricing page text for AI extraction
+- Detects PDF links for PDF-based pricing extraction
+
+For each provider with a confirmed website:
+
+### 3a. Homepage crawl
+- Fetch homepage HTML
+- Extract: description/about text, contact details
+- Look for links to pricing/services pages
+
+### 3b. Pricing page discovery
+Try common URL patterns:
+  /pricing, /prices, /packages, /services, /our-services,
+  /funeral-costs, /funeral-packages, /service-options,
+  /price-list, /transparency
+
+Also:
+- Parse sitemap.xml if available
+- Follow links containing "pric", "packag", "cost", "service"
+- Check for PDF links on pricing pages
+
+### 3c. AI extraction (Claude Haiku)
+- Send pricing page HTML to Haiku
+- Extract: package names, funeral types, prices, inclusions
+- Map to known inclusion types where possible
+- Return confidence score
+
+### 3d. PDF extraction (for InvoCare-type sites)
+- Download compliance PDFs
+- Extract text (pdftotext or similar)
+- Send to Haiku for structured extraction
+- ~25% of sites are PDF-only for pricing
+
+## Listing Tiers
+
+Providers are assigned a `listing_tier` based on data quality. Computed
+automatically by `compute_tiers.py` after each enrichment run.
+
+| Tier | Label | Criteria | Display |
+|------|-------|----------|---------|
+| `verified` | Full partner | `verified = true` (signed up) | Full branding, packages, arrangements |
+| `priced` | Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail |
+| `estimated` | Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns |
+| `listed` | Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt |
+
+Each tier below `verified` motivates the provider to sign up:
+- `listed` → "Publish your pricing to attract more families"
+- `estimated` → "Add detailed breakdowns to stand out"
+- `priced` → "Sign up to enable online arrangements"
+
+## Enrichment Status Flow
+
+```
+pending ──▶ website_found ──▶ partial ──▶ complete
+   │              │               │
+   └──▶ no_website_found    failed (retry later)
+```
+
+## N8N Workflow Design
+
+### Workflow 1: Weekly Discovery
+Cron → Run all source crawlers → Dedup into DB → Queue new providers
+
+### Workflow 2: Daily Website Discovery
+Cron → Fetch providers with no website → Google Places lookup
+     → ABN lookup → Search fallback → Update DB
+
+### Workflow 3: Daily Enrichment
+Cron → Fetch providers with website but no packages
+     → Crawl website → AI extract → Update DB
+
+### Workflow 4: Monthly Re-check
+Cron → Re-crawl enriched providers → Update pricing if changed
+
+---
+
+## Module Inventory
+
+| Module | Purpose | N8N Workflow |
+|--------|---------|-------------|
+| `base.py` | Shared HTTP, DB, normalization utils | Used by all |
+| `crawl_vic_register.py` | VIC government register (796 records) | Workflow 1 |
+| `crawl_funerals_australia.py` | Funerals Australia API (997 records) | Workflow 1 |
+| `crawl_nfda.py` | NFDA directory API (209 records) | Workflow 1 |
+| `crawl_all.py` | Orchestrates all source crawlers | Workflow 1 |
+| `dedup.py` | Cross-source dedup & merge engine | Workflow 1 |
+| `discover_websites.py` | Find provider websites (Serper/DDG/guess) | Workflow 2 |
+| `lookup_abn.py` | ABN validation via ABR API (free) | Workflow 2 |
+| `enrich_websites.py` | Crawl provider sites, find pricing pages | Workflow 3 |
+| `compute_tiers.py` | Compute listing_tier from data quality | After enrichment |
+| `config.example.json` | API key template | — |
+
+## API Keys Required
+
+| Service | Key | Cost | Register |
+|---------|-----|------|----------|
+| Serper.dev | `serper_api_key` | 2,500 free, then $1/1K | https://serper.dev |
+| ABR (ABN Lookup) | `abr_guid` | Free | https://abr.business.gov.au/Tools/WebServices |
+| Anthropic (Haiku) | `anthropic_api_key` | ~$2/full run | https://console.anthropic.com |
+
+## Quick Start
+
+```bash
+# 1. Configure API keys
+cp config.example.json config.json
+# Edit config.json with your keys
+
+# 2. Reset database
+cd ../database
+sqlite3 providers.db < schema_sqlite.sql
+
+# 3. Run full discovery pipeline
+cd ../crawlers
+python3 crawl_all.py          # Step 1: Discover from registries
+python3 dedup.py              # Deduplicate across sources
+python3 lookup_abn.py         # Step 2a: Get ABNs (free)
+python3 discover_websites.py  # Step 2b: Find websites
+python3 enrich_websites.py    # Step 3: Crawl for pricing
+python3 compute_tiers.py      # Assign listing tiers
+
+# Test mode (limited records)
+python3 crawl_all.py --test
+python3 discover_websites.py --limit=10 --state=VIC
+python3 enrich_websites.py --limit=5
+```
--- a/crawlers/base.py
+++ b/crawlers/base.py
@@ -0,0 +1,164 @@
+"""Base crawler module with shared utilities."""
+
+import gzip
+import io
+import json
+import time
+import sqlite3
+import urllib.request
+import urllib.parse
+import urllib.error
+from datetime import datetime, timezone
+from pathlib import Path
+
+DB_PATH = Path(__file__).parent.parent / "database" / "providers.db"
+CRAWL_DELAY = 1.0  # seconds between requests (courtesy)
+
+USER_AGENT = (
+    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
+    "(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
+)
+
+
+def fetch_url(url: str, method: str = "GET", data: dict | None = None,
+              headers: dict | None = None, timeout: int = 30) -> str:
+    """Fetch a URL and return the response body as text."""
+    hdrs = {"User-Agent": USER_AGENT}
+    if headers:
+        hdrs.update(headers)
+
+    body = None
+    if data and method == "POST":
+        body = urllib.parse.urlencode(data, doseq=True).encode("utf-8")
+        hdrs.setdefault("Content-Type", "application/x-www-form-urlencoded")
+    elif data and method == "GET":
+        url = url + "?" + urllib.parse.urlencode(data, doseq=True)
+
+    req = urllib.request.Request(url, data=body, headers=hdrs, method=method)
+    with urllib.request.urlopen(req, timeout=timeout) as resp:
+        raw = resp.read()
+        # Handle gzip-compressed responses
+        if resp.headers.get("Content-Encoding") == "gzip" or raw[:2] == b"\x1f\x8b":
+            raw = gzip.decompress(raw)
+        charset = resp.headers.get_content_charset() or "utf-8"
+        return raw.decode(charset)
+
+
+def fetch_json(url: str, method: str = "GET", data: dict | None = None,
+               headers: dict | None = None) -> dict:
+    """Fetch a URL and parse the response as JSON."""
+    text = fetch_url(url, method=method, data=data, headers=headers)
+    return json.loads(text)
+
+
+def get_db() -> sqlite3.Connection:
+    """Get a connection to the SQLite database."""
+    conn = sqlite3.connect(str(DB_PATH))
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA foreign_keys=ON")
+    conn.row_factory = sqlite3.Row
+    return conn
+
+
+def start_crawl_log(db: sqlite3.Connection, source_name: str) -> int:
+    """Create a source_log entry and return its ID."""
+    cur = db.execute(
+        "INSERT INTO source_log (source_name) VALUES (?)",
+        (source_name,)
+    )
+    db.commit()
+    return cur.lastrowid
+
+
+def finish_crawl_log(db: sqlite3.Connection, log_id: int,
+                     found: int, new: int, updated: int, skipped: int,
+                     status: str = "completed", error: str | None = None):
+    """Update a source_log entry with results."""
+    db.execute(
+        """UPDATE source_log
+           SET run_finished_at = datetime('now'),
+               records_found = ?, records_new = ?,
+               records_updated = ?, records_skipped = ?,
+               status = ?, error_message = ?
+           WHERE id = ?""",
+        (found, new, updated, skipped, status, error, log_id)
+    )
+    db.commit()
+
+
+def store_source_record(db: sqlite3.Connection, source_name: str,
+                        source_id: str, source_url: str | None,
+                        raw_data: dict, log_id: int) -> int | None:
+    """Store a raw source record. Returns the row ID, or None if duplicate."""
+    try:
+        cur = db.execute(
+            """INSERT INTO source_record
+               (source_name, source_id, source_url, raw_data, log_id)
+               VALUES (?, ?, ?, ?, ?)""",
+            (source_name, source_id, source_url, json.dumps(raw_data), log_id)
+        )
+        db.commit()
+        return cur.lastrowid
+    except sqlite3.IntegrityError:
+        # Duplicate source_name + source_id — already have this record
+        return None
+
+
+def normalize_phone(phone: str | None) -> str | None:
+    """Basic phone normalization."""
+    if not phone:
+        return None
+    # Remove common noise
+    phone = phone.strip().replace("\xa0", " ")
+    # If multiple numbers, take the first
+    for sep in [";", "/", "|", ","]:
+        if sep in phone:
+            phone = phone.split(sep)[0].strip()
+    return phone or None
+
+
+def normalize_state(state: str | None) -> str | None:
+    """Normalize Australian state names to abbreviations."""
+    if not state:
+        return None
+    state = state.strip().upper()
+    mapping = {
+        "NEW SOUTH WALES": "NSW",
+        "VICTORIA": "VIC",
+        "QUEENSLAND": "QLD",
+        "SOUTH AUSTRALIA": "SA",
+        "WESTERN AUSTRALIA": "WA",
+        "TASMANIA": "TAS",
+        "NORTHERN TERRITORY": "NT",
+        "AUSTRALIAN CAPITAL TERRITORY": "ACT",
+        "AUSTRALIA CAPITAL TERRITORY": "ACT",
+    }
+    result = mapping.get(state, state)
+    # Only return valid Australian states
+    valid = {"NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"}
+    return result if result in valid else None
+
+
+def generate_slug(name: str) -> str:
+    """Generate a URL-safe slug from a business name."""
+    import re
+    slug = name.lower().strip()
+    slug = re.sub(r"[''`]", "", slug)          # remove apostrophes
+    slug = re.sub(r"[^a-z0-9]+", "-", slug)    # non-alphanum -> hyphen
+    slug = slug.strip("-")
+    return slug
+
+
+def to_intermediate(source: str, source_id: str, source_url: str | None,
+                    business: dict, locations: list[dict],
+                    packages: list[dict] | None = None) -> dict:
+    """Build the normalized intermediate format record."""
+    return {
+        "source": source,
+        "sourceId": source_id,
+        "sourceUrl": source_url,
+        "scrapedAt": datetime.now(timezone.utc).isoformat(),
+        "business": business,
+        "locations": locations,
+        "packages": packages or [],
+    }
--- a/crawlers/compute_tiers.py
+++ b/crawlers/compute_tiers.py
@@ -0,0 +1,102 @@
+"""Compute listing_tier for all providers based on their data quality.
+
+Tier logic:
+  verified  — brand.verified = true (signed up to platform)
+  priced    — has 2+ packages with at least one inclusion that has a price > 0
+  estimated — has at least one package with a total price > 0
+  listed    — everything else (contact info only)
+
+Run this after enrichment to update tiers across the board.
+"""
+
+from base import get_db
+
+
+def compute_tier(db, brand_id: int, verified: bool) -> str:
+    """Compute the listing tier for a single brand."""
+    if verified:
+        return "verified"
+
+    # Check packages
+    packages = db.execute(
+        "SELECT id, title, funeral_type FROM package WHERE brand_id = ?",
+        (brand_id,)
+    ).fetchall()
+
+    if not packages:
+        return "listed"
+
+    # Count packages that have a meaningful total price
+    # A package's price = sum of non-optional, non-complimentary inclusions
+    packages_with_price = 0
+    packages_with_itemized = 0
+
+    for pkg in packages:
+        inclusions = db.execute(
+            """SELECT price, optional, complimentary
+               FROM package_inclusion
+               WHERE package_id = ?""",
+            (pkg["id"],)
+        ).fetchall()
+
+        if inclusions:
+            # Has itemized inclusions with prices
+            priced_inclusions = [
+                i for i in inclusions
+                if i["price"] and float(i["price"]) > 0
+            ]
+            if len(priced_inclusions) >= 2:
+                packages_with_itemized += 1
+                packages_with_price += 1
+            elif len(priced_inclusions) >= 1:
+                packages_with_price += 1
+        else:
+            # Package exists but no inclusions — check if we stored a total
+            # price in the package description or via source data
+            # For now, a package with a funeral_type means we at least know
+            # what kind of service it is, even without breakdown
+            packages_with_price += 1
+
+    # Tier 2 (priced): 2+ packages with itemized breakdowns
+    if packages_with_itemized >= 2:
+        return "priced"
+
+    # Tier 3 (estimated): at least one package with some price
+    if packages_with_price >= 1:
+        return "estimated"
+
+    return "listed"
+
+
+def run():
+    """Recompute listing_tier for all brands."""
+    db = get_db()
+
+    brands = db.execute(
+        "SELECT id, verified FROM funeral_brand"
+    ).fetchall()
+
+    counts = {"verified": 0, "priced": 0, "estimated": 0, "listed": 0}
+
+    for brand in brands:
+        tier = compute_tier(db, brand["id"], brand["verified"])
+        db.execute(
+            "UPDATE funeral_brand SET listing_tier = ? WHERE id = ?",
+            (tier, brand["id"])
+        )
+        counts[tier] += 1
+
+    db.commit()
+
+    print("Listing Tier Distribution:")
+    print(f"  verified:  {counts['verified']:>6d}  (signed-up partners)")
+    print(f"  priced:    {counts['priced']:>6d}  (full package breakdowns)")
+    print(f"  estimated: {counts['estimated']:>6d}  (some pricing info)")
+    print(f"  listed:    {counts['listed']:>6d}  (contact info only)")
+    print(f"  TOTAL:     {sum(counts.values()):>6d}")
+
+    db.close()
+
+
+if __name__ == "__main__":
+    run()
--- a/crawlers/config.example.json
+++ b/crawlers/config.example.json
@@ -0,0 +1,5 @@
+{
+    "serper_api_key": null,
+    "abr_guid": null,
+    "anthropic_api_key": null
+}
--- a/crawlers/crawl_all.py
+++ b/crawlers/crawl_all.py
@@ -0,0 +1,70 @@
+"""Run all source crawlers and then deduplicate into the provider database."""
+
+import sys
+import time
+from pathlib import Path
+
+from base import get_db
+
+
+def run_all(gathered_here_limit: int | None = None):
+    """Run all crawlers sequentially."""
+    print("=" * 60)
+    print("PROVIDER DISCOVERY PIPELINE")
+    print("=" * 60)
+
+    # Import crawlers
+    import crawl_nfda
+    import crawl_funerals_australia
+    import crawl_vic_register
+    import crawl_gathered_here
+
+    # Run in order: fast API sources first, then slower HTML scraping
+    print("\n--- 1/4: NFDA Directory ---")
+    crawl_nfda.run()
+
+    print("\n--- 2/4: Funerals Australia ---")
+    crawl_funerals_australia.run()
+
+    print("\n--- 3/4: VIC Consumer Affairs Register ---")
+    crawl_vic_register.run()
+
+    print("\n--- 4/4: Gathered Here ---")
+    crawl_gathered_here.run(limit=gathered_here_limit)
+
+    # Summary
+    db = get_db()
+    print("\n" + "=" * 60)
+    print("CRAWL SUMMARY")
+    print("=" * 60)
+
+    rows = db.execute(
+        """SELECT source_name,
+                  COUNT(*) as total,
+                  SUM(CASE WHEN matched_brand_id IS NOT NULL THEN 1 ELSE 0 END) as matched
+           FROM source_record
+           GROUP BY source_name"""
+    ).fetchall()
+
+    for row in rows:
+        print(f"  {row['source_name']:25s} {row['total']:5d} records "
+              f"({row['matched']} matched)")
+
+    total = db.execute("SELECT COUNT(*) as n FROM source_record").fetchone()["n"]
+    print(f"  {'TOTAL':25s} {total:5d} records")
+
+    db.close()
+
+
+if __name__ == "__main__":
+    limit = None
+    if "--test" in sys.argv:
+        limit = 10
+        print("TEST MODE: Gathered Here limited to 10 profiles")
+    elif len(sys.argv) > 1:
+        try:
+            limit = int(sys.argv[1])
+        except ValueError:
+            pass
+
+    run_all(gathered_here_limit=limit)
--- a/crawlers/crawl_funerals_australia.py
+++ b/crawlers/crawl_funerals_australia.py
@@ -0,0 +1,179 @@
+"""Crawler for the Funerals Australia (formerly AFDA) member directory.
+
+Source: https://funeralsaustralia.org.au/find-a-member/
+Method: WordPress AJAX API (POST with get_clients_list action)
+Fields: name, address (structured), phone, email, website, lat/lng, displayImage
+"""
+
+import time
+import json
+from pathlib import Path
+
+from base import (
+    fetch_url, get_db, start_crawl_log, finish_crawl_log,
+    store_source_record, normalize_phone, normalize_state,
+    generate_slug, to_intermediate, CRAWL_DELAY,
+)
+
+SOURCE_NAME = "funerals_australia"
+API_URL = "https://funeralsaustralia.org.au/wp-admin/admin-ajax.php"
+
+PAGE_SIZE = 200  # API supports up to 200 per page
+
+
+def fetch_page(offset: int = 0) -> dict:
+    """Fetch a page of all members from the Funerals Australia API.
+
+    The API returns all members when no postcode/suburb filter is given,
+    which is more reliable than geo-filtered searches.
+    """
+    form_data = {
+        "action": "get_clients_list",
+        "params[size]": str(PAGE_SIZE),
+        "params[from]": str(offset),
+        "params[forceResults]": "true",
+        "params[paginated]": "true",
+    }
+
+    text = fetch_url(API_URL, method="POST", data=form_data,
+                     headers={"X-Requested-With": "XMLHttpRequest"})
+    return json.loads(text)
+
+
+def fetch_all_members() -> list[dict]:
+    """Fetch all members via pagination."""
+    all_results = []
+    offset = 0
+
+    while True:
+        data = fetch_page(offset)
+        results = data.get("results", [])
+        total = data.get("total", 0)
+
+        if not results:
+            break
+
+        all_results.extend(results)
+        print(f"    Fetched {len(all_results)}/{total}...")
+        offset += PAGE_SIZE
+
+        if offset >= total:
+            break
+
+        time.sleep(CRAWL_DELAY)
+
+    return all_results
+
+
+def parse_address(record: dict) -> dict:
+    """Extract structured address from a Funerals Australia record."""
+    addr_list = record.get("address", [])
+    if addr_list and isinstance(addr_list, list) and len(addr_list) > 0:
+        addr = addr_list[0]
+        return {
+            "line1": addr.get("line1", "").strip(),
+            "city": addr.get("city", "").strip(),
+            "state": normalize_state(addr.get("state")),
+            "postcode": addr.get("postcode", "").strip(),
+        }
+    return {"line1": "", "city": "", "state": None, "postcode": ""}
+
+
+def to_normalized(record: dict) -> dict:
+    """Convert a Funerals Australia record to intermediate format."""
+    addr = parse_address(record)
+    city = addr["city"]
+    if city and city == city.upper():
+        city = city.title()
+
+    lat_val = record.get("latitude")
+    lng_val = record.get("longitude")
+    try:
+        lat_val = float(lat_val) if lat_val else None
+        lng_val = float(lng_val) if lng_val else None
+    except (ValueError, TypeError):
+        lat_val = lng_val = None
+
+    website = record.get("website", "").strip() or None
+    if website and not website.startswith("http"):
+        website = "https://" + website
+
+    business = {
+        "name": record.get("name", "").strip(),
+        "abn": None,
+        "phone": normalize_phone(record.get("phone")),
+        "email": record.get("email", "").strip() or None,
+        "website": website,
+        "description": None,
+    }
+
+    locations = [{
+        "address": addr["line1"],
+        "suburb": city,
+        "state": addr["state"],
+        "postcode": addr["postcode"],
+        "lat": lat_val,
+        "lng": lng_val,
+        "phone": normalize_phone(record.get("phone")),
+    }]
+
+    source_id = record.get("id", "")
+    return to_intermediate(
+        source=SOURCE_NAME,
+        source_id=source_id,
+        source_url="https://funeralsaustralia.org.au/find-a-member/",
+        business=business,
+        locations=locations,
+    )
+
+
+def run():
+    """Run the full Funerals Australia crawl."""
+    db = get_db()
+    log_id = start_crawl_log(db, SOURCE_NAME)
+    print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
+
+    all_records = []
+    found = 0
+    new = 0
+    skipped = 0
+
+    try:
+        print("  Fetching all members (paginated)...")
+        all_records = fetch_all_members()
+        found = len(all_records)
+        print(f"  Total members fetched: {found}")
+
+        # Store records
+        for record in all_records:
+            source_id = record.get("id", "")
+            row_id = store_source_record(
+                db, SOURCE_NAME, source_id,
+                "https://funeralsaustralia.org.au/find-a-member/",
+                record, log_id
+            )
+            if row_id:
+                normalized = to_normalized(record)
+                db.execute(
+                    "UPDATE source_record SET normalized_data = ? WHERE id = ?",
+                    (json.dumps(normalized), row_id)
+                )
+                new += 1
+            else:
+                skipped += 1
+
+        db.commit()
+        finish_crawl_log(db, log_id, found, new, 0, skipped)
+        print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
+
+    except Exception as e:
+        finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
+        raise
+    finally:
+        db.close()
+
+    return all_records
+
+
+if __name__ == "__main__":
+    run()
--- a/crawlers/crawl_gathered_here.py
+++ b/crawlers/crawl_gathered_here.py
@@ -0,0 +1,362 @@
+"""Crawler for Gathered Here funeral director directory.
+
+Source: https://www.gatheredhere.com.au
+Method: XML sitemap → fetch individual profile pages → parse HTML
+Fields: name, address, coords, phone, email, website, description, pricing, reviews
+"""
+
+import re
+import time
+import json
+import xml.etree.ElementTree as ET
+from html.parser import HTMLParser
+from pathlib import Path
+
+from base import (
+    fetch_url, get_db, start_crawl_log, finish_crawl_log,
+    store_source_record, normalize_phone, normalize_state,
+    generate_slug, to_intermediate, CRAWL_DELAY,
+)
+
+SOURCE_NAME = "gathered_here"
+SITEMAP_URL = "https://www.gatheredhere.com.au/sitemap/sitemap-funerals-listings-0.xml"
+BASE_URL = "https://www.gatheredhere.com.au"
+
+
+def fetch_all_listing_urls() -> list[str]:
+    """Fetch and parse the sitemap to get all funeral director profile URLs."""
+    xml_text = fetch_url(SITEMAP_URL)
+    root = ET.fromstring(xml_text)
+    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
+
+    urls = []
+    for url_elem in root.findall("sm:url", ns):
+        loc = url_elem.find("sm:loc", ns)
+        if loc is not None and loc.text:
+            url = loc.text.strip()
+            # Only include individual profile pages (singular /funeral-director/)
+            if "/funeral-director/" in url and "/funeral-directors/" not in url:
+                urls.append(url)
+
+    return urls
+
+
+def extract_next_data(html_text: str) -> dict | None:
+    """Extract __NEXT_DATA__ JSON from a Next.js page."""
+    pattern = r'<script\s+id="__NEXT_DATA__"\s+type="application/json">(.*?)</script>'
+    match = re.search(pattern, html_text, re.DOTALL)
+    if match:
+        try:
+            return json.loads(match.group(1))
+        except json.JSONDecodeError:
+            return None
+    return None
+
+
+def extract_from_next_data(next_data: dict) -> dict | None:
+    """Extract listing data from __NEXT_DATA__ props."""
+    try:
+        props = next_data.get("props", {}).get("pageProps", {})
+
+        # Structure: singleListing.listing contains the actual data
+        single = props.get("singleListing", {})
+        if single:
+            listing = single.get("listing")
+            if listing and isinstance(listing, dict):
+                return listing
+
+        # Fallback paths
+        listing = props.get("listing") or props.get("post") or props.get("data")
+        return listing
+    except (KeyError, TypeError):
+        return None
+
+
+def extract_from_html(html_text: str, url: str) -> dict:
+    """Extract listing data from page HTML using regex patterns as fallback."""
+    data = {"url": url}
+
+    # Title
+    title_match = re.search(r'<h1[^>]*>(.*?)</h1>', html_text, re.DOTALL)
+    if title_match:
+        data["title"] = re.sub(r'<[^>]+>', '', title_match.group(1)).strip()
+
+    # Phone
+    phone_match = re.search(r'href="tel:([^"]+)"', html_text)
+    if phone_match:
+        data["phone"] = phone_match.group(1).strip()
+
+    # Email
+    email_match = re.search(r'href="mailto:([^"]+)"', html_text)
+    if email_match:
+        data["email"] = email_match.group(1).strip()
+
+    # Website
+    website_match = re.search(
+        r'<a[^>]*class="[^"]*website[^"]*"[^>]*href="([^"]+)"', html_text
+    )
+    if website_match:
+        data["website"] = website_match.group(1).strip()
+
+    # Address from structured data
+    addr_match = re.search(
+        r'"streetAddress"\s*:\s*"([^"]*)"', html_text
+    )
+    if addr_match:
+        data["address"] = addr_match.group(1)
+
+    locality_match = re.search(r'"addressLocality"\s*:\s*"([^"]*)"', html_text)
+    if locality_match:
+        data["suburb"] = locality_match.group(1)
+
+    region_match = re.search(r'"addressRegion"\s*:\s*"([^"]*)"', html_text)
+    if region_match:
+        data["state"] = region_match.group(1)
+
+    postcode_match = re.search(r'"postalCode"\s*:\s*"([^"]*)"', html_text)
+    if postcode_match:
+        data["postcode"] = postcode_match.group(1)
+
+    # Coordinates
+    lat_match = re.search(r'"latitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
+    lng_match = re.search(r'"longitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
+    if lat_match:
+        data["lat"] = float(lat_match.group(1))
+    if lng_match:
+        data["lng"] = float(lng_match.group(1))
+
+    return data
+
+
+def extract_pricing(listing_data: dict) -> dict:
+    """Extract pricing from listing meta fields."""
+    meta = listing_data.get("meta", {})
+    if not meta:
+        return {}
+
+    pricing = {}
+    price_fields = {
+        # With viewing prices
+        "cremation_no_service_viewY": "cremation_no_service_with_viewing",
+        "cremation_single_viewY": "cremation_single_service_with_viewing",
+        "cremation_dual_viewY": "cremation_dual_service_with_viewing",
+        "cremation_graveside_viewY": "cremation_graveside_with_viewing",
+        "burial_single_viewY": "burial_single_service_with_viewing",
+        "burial_dual_viewY": "burial_dual_service_with_viewing",
+        "burial_graveside_viewY": "burial_graveside_with_viewing",
+        "burial_no_service_viewY": "burial_no_service_with_viewing",
+        # Without viewing prices
+        "cremation_no_service_viewN": "cremation_no_service",
+        "cremation_single_viewN": "cremation_single_service",
+        "cremation_dual_viewN": "cremation_dual_service",
+        "cremation_graveside_viewN": "cremation_graveside",
+        "burial_single_viewN": "burial_single_service",
+        "burial_dual_viewN": "burial_dual_service",
+        "burial_graveside_viewN": "burial_graveside",
+        "burial_no_service_viewN": "burial_no_service",
+    }
+
+    for meta_key, label in price_fields.items():
+        val = meta.get(meta_key, "")
+        if val:
+            # Parse price string like "$2,299" to float
+            cleaned = re.sub(r'[^\d.]', '', str(val))
+            if cleaned:
+                try:
+                    pricing[label] = float(cleaned)
+                except ValueError:
+                    pass
+
+    return pricing
+
+
+def pricing_to_packages(pricing: dict) -> list[dict]:
+    """Convert flat pricing dict to package format."""
+    packages = []
+
+    # Map pricing keys to funeral types
+    type_mappings = [
+        ("cremation_no_service", "Cremation Only"),
+        ("cremation_single_service", "Service & Cremation"),
+        ("cremation_single_service_with_viewing", "Service & Cremation"),
+        ("burial_single_service", "Service & Burial"),
+        ("burial_graveside", "Graveside Burial"),
+    ]
+
+    for price_key, funeral_type in type_mappings:
+        if price_key in pricing:
+            name = price_key.replace("_", " ").title()
+            packages.append({
+                "name": name,
+                "funeralType": funeral_type,
+                "price": pricing[price_key],
+                "inclusions": [],  # Not available from Gathered Here listing pages
+            })
+
+    return packages
+
+
+def to_normalized(listing_data: dict, url: str) -> dict:
+    """Convert Gathered Here listing data to intermediate format."""
+    meta = listing_data.get("meta", {}) if isinstance(listing_data.get("meta"), dict) else {}
+
+    name = listing_data.get("title", listing_data.get("name", "")).strip()
+    slug = listing_data.get("slug", "")
+
+    # Extract location
+    suburb = meta.get("geolocation_city", "")
+    state = normalize_state(meta.get("geolocation_state_short", ""))
+    postcode = meta.get("geolocation_postcode", "")
+    lat = meta.get("geolocation_lat")
+    lng = meta.get("geolocation_long")
+
+    try:
+        lat = float(lat) if lat else None
+        lng = float(lng) if lng else None
+    except (ValueError, TypeError):
+        lat = lng = None
+
+    email = meta.get("email", "") or meta.get("_application", "")
+    phone = meta.get("phone", "") or listing_data.get("phone", "")
+
+    # Try to get description from content or excerpt
+    description = listing_data.get("excerpt", listing_data.get("content", ""))
+    if description:
+        description = re.sub(r'<[^>]+>', '', description).strip()
+        if len(description) > 500:
+            description = description[:497] + "..."
+
+    # Website
+    website = listing_data.get("website") or meta.get("website") or None
+
+    # Pricing
+    pricing = extract_pricing(listing_data)
+    packages = pricing_to_packages(pricing)
+
+    business = {
+        "name": name,
+        "abn": None,
+        "phone": normalize_phone(phone),
+        "email": email.strip() or None,
+        "website": website,
+        "description": description or None,
+    }
+
+    locations = [{
+        "address": meta.get("geolocation_formatted_address", ""),
+        "suburb": suburb,
+        "state": state,
+        "postcode": postcode,
+        "lat": lat,
+        "lng": lng,
+        "phone": normalize_phone(phone),
+    }]
+
+    source_id = slug or generate_slug(name)
+    return to_intermediate(
+        source=SOURCE_NAME,
+        source_id=source_id,
+        source_url=url,
+        business=business,
+        locations=locations,
+        packages=packages,
+    )
+
+
+def crawl_profile(url: str) -> dict | None:
+    """Crawl a single Gathered Here profile page."""
+    try:
+        html_text = fetch_url(url)
+    except Exception as e:
+        print(f"    Error fetching {url}: {e}")
+        return None
+
+    # Try __NEXT_DATA__ first (structured)
+    next_data = extract_next_data(html_text)
+    if next_data:
+        listing = extract_from_next_data(next_data)
+        if listing:
+            listing["_source"] = "next_data"
+            return listing
+
+    # Fallback to HTML parsing
+    data = extract_from_html(html_text, url)
+    data["_source"] = "html_fallback"
+    return data
+
+
+def run(limit: int | None = None):
+    """Run the full Gathered Here crawl.
+
+    Args:
+        limit: If set, only crawl this many profiles (for testing).
+    """
+    db = get_db()
+    log_id = start_crawl_log(db, SOURCE_NAME)
+    print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
+
+    found = 0
+    new = 0
+    skipped = 0
+    errors = 0
+
+    try:
+        # Step 1: Get all profile URLs from sitemap
+        print("  Fetching sitemap...", end=" ", flush=True)
+        urls = fetch_all_listing_urls()
+        print(f"{len(urls)} profile URLs found")
+
+        if limit:
+            urls = urls[:limit]
+            print(f"  (limited to {limit} for testing)")
+
+        # Step 2: Crawl each profile
+        for i, url in enumerate(urls):
+            slug = url.rstrip("/").split("/")[-1]
+
+            if (i + 1) % 50 == 0 or i == 0:
+                print(f"  Crawling {i+1}/{len(urls)}: {slug}")
+
+            listing_data = crawl_profile(url)
+            found += 1
+
+            if not listing_data:
+                errors += 1
+                continue
+
+            source_id = slug
+            row_id = store_source_record(
+                db, SOURCE_NAME, source_id, url, listing_data, log_id
+            )
+
+            if row_id:
+                normalized = to_normalized(listing_data, url)
+                db.execute(
+                    "UPDATE source_record SET normalized_data = ? WHERE id = ?",
+                    (json.dumps(normalized), row_id)
+                )
+                new += 1
+            else:
+                skipped += 1
+
+            if (i + 1) % 10 == 0:
+                db.commit()  # periodic commit
+
+            time.sleep(CRAWL_DELAY)
+
+        db.commit()
+        finish_crawl_log(db, log_id, found, new, 0, skipped)
+        print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, "
+              f"{skipped} skipped, {errors} errors")
+
+    except Exception as e:
+        finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
+        raise
+    finally:
+        db.close()
+
+
+if __name__ == "__main__":
+    import sys
+    limit = int(sys.argv[1]) if len(sys.argv) > 1 else None
+    run(limit=limit)
--- a/crawlers/crawl_nfda.py
+++ b/crawlers/crawl_nfda.py
@@ -0,0 +1,163 @@
+"""Crawler for the NFDA (National Funeral Directors Association) directory.
+
+Source: https://nfda.com.au/find-your-local-nfda-member/
+Method: WPSL JSON API (GET requests with lat/lng search)
+Fields: name, address, city, state, postcode, lat/lng, phone, email
+"""
+
+import time
+import json
+from pathlib import Path
+
+from base import (
+    fetch_json, get_db, start_crawl_log, finish_crawl_log,
+    store_source_record, normalize_phone, normalize_state,
+    generate_slug, to_intermediate, CRAWL_DELAY,
+)
+
+SOURCE_NAME = "nfda"
+API_URL = "https://nfda.com.au/wp-admin/admin-ajax.php"
+
+# Search centroids covering Australia with large radius
+SEARCH_POINTS = [
+    {"name": "Sydney", "lat": -33.87, "lng": 151.21},
+    {"name": "Melbourne", "lat": -37.81, "lng": 144.96},
+    {"name": "Brisbane", "lat": -27.47, "lng": 153.03},
+    {"name": "Perth", "lat": -31.95, "lng": 115.86},
+    {"name": "Adelaide", "lat": -34.93, "lng": 138.60},
+    {"name": "Hobart", "lat": -42.88, "lng": 147.33},
+    {"name": "Darwin", "lat": -12.46, "lng": 130.85},
+    {"name": "Townsville", "lat": -19.26, "lng": 146.82},
+    {"name": "Central NSW", "lat": -30.0, "lng": 150.0},
+    {"name": "Inland QLD", "lat": -23.0, "lng": 145.0},
+]
+
+
+def fetch_members(lat: float, lng: float, max_results: int = 50,
+                  radius: int = 5000) -> list[dict]:
+    """Fetch NFDA members near a given lat/lng."""
+    params = {
+        "action": "store_search",
+        "lat": str(lat),
+        "lng": str(lng),
+        "max_results": str(max_results),
+        "search_radius": str(radius),
+        "autoload": "1",
+    }
+    data = fetch_json(API_URL, method="GET", data=params)
+    if isinstance(data, list):
+        return data
+    return []
+
+
+def to_normalized(record: dict) -> dict:
+    """Convert an NFDA record to intermediate format."""
+    state = normalize_state(record.get("state", ""))
+
+    business = {
+        "name": record.get("store", "").strip(),
+        "abn": None,
+        "phone": normalize_phone(record.get("phone")),
+        "email": record.get("email", "").strip() or None,
+        "website": record.get("url", "").strip() or None,
+        "description": None,
+    }
+
+    lat_val = record.get("lat")
+    lng_val = record.get("lng")
+    try:
+        lat_val = float(lat_val) if lat_val else None
+        lng_val = float(lng_val) if lng_val else None
+    except (ValueError, TypeError):
+        lat_val = lng_val = None
+
+    city = record.get("city", "").strip()
+    # Normalize city casing (some are ALL CAPS)
+    if city and city == city.upper():
+        city = city.title()
+
+    locations = [{
+        "address": record.get("address", "").strip(),
+        "suburb": city,
+        "state": state,
+        "postcode": record.get("zip", "").strip(),
+        "lat": lat_val,
+        "lng": lng_val,
+        "phone": normalize_phone(record.get("phone")),
+    }]
+
+    source_id = str(record.get("id", ""))
+    return to_intermediate(
+        source=SOURCE_NAME,
+        source_id=source_id,
+        source_url="https://nfda.com.au/find-your-local-nfda-member/",
+        business=business,
+        locations=locations,
+    )
+
+
+def run():
+    """Run the full NFDA crawl."""
+    db = get_db()
+    log_id = start_crawl_log(db, SOURCE_NAME)
+    print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
+
+    seen_ids = set()
+    all_records = []
+    found = 0
+    new = 0
+    skipped = 0
+
+    try:
+        for point in SEARCH_POINTS:
+            print(f"  Searching near {point['name']}...", end=" ", flush=True)
+            members = fetch_members(point["lat"], point["lng"])
+            new_count = 0
+
+            for member in members:
+                member_id = str(member.get("id", ""))
+                if member_id in seen_ids:
+                    continue
+                seen_ids.add(member_id)
+                all_records.append(member)
+                new_count += 1
+
+            print(f"{len(members)} results, {new_count} new unique")
+            found += len(members)
+            time.sleep(CRAWL_DELAY)
+
+        print(f"  Total unique members: {len(all_records)}")
+
+        # Store records
+        for record in all_records:
+            source_id = str(record.get("id", ""))
+            row_id = store_source_record(
+                db, SOURCE_NAME, source_id,
+                "https://nfda.com.au/find-your-local-nfda-member/",
+                record, log_id
+            )
+            if row_id:
+                normalized = to_normalized(record)
+                db.execute(
+                    "UPDATE source_record SET normalized_data = ? WHERE id = ?",
+                    (json.dumps(normalized), row_id)
+                )
+                new += 1
+            else:
+                skipped += 1
+
+        db.commit()
+        finish_crawl_log(db, log_id, found, new, 0, skipped)
+        print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
+
+    except Exception as e:
+        finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
+        raise
+    finally:
+        db.close()
+
+    return all_records
+
+
+if __name__ == "__main__":
+    run()
--- a/crawlers/crawl_vic_register.py
+++ b/crawlers/crawl_vic_register.py
@@ -0,0 +1,220 @@
+"""Crawler for the VIC Consumer Affairs Public Register of Funeral Providers.
+
+Source: https://registers.consumer.vic.gov.au/fpsearch
+Method: HTTP GET per letter A-Z, parse HTML tables
+Fields: name, place of business, postcode, postal address, phone
+"""
+
+import re
+import time
+import json
+import html.parser
+from pathlib import Path
+
+from base import (
+    fetch_url, get_db, start_crawl_log, finish_crawl_log,
+    store_source_record, normalize_phone, generate_slug,
+    to_intermediate, CRAWL_DELAY,
+)
+
+SOURCE_NAME = "vic_register"
+BASE_URL = "https://registers.consumer.vic.gov.au/FpSearch/PerformSearch"
+
+
+class VICTableParser(html.parser.HTMLParser):
+    """Parse the VIC register HTML table into records."""
+
+    def __init__(self):
+        super().__init__()
+        self.records = []
+        self._in_table = False
+        self._in_tbody = False
+        self._in_row = False
+        self._in_cell = False
+        self._current_row = []
+        self._current_cell = ""
+
+    def handle_starttag(self, tag, attrs):
+        if tag == "table":
+            self._in_table = True
+        elif tag == "tbody" and self._in_table:
+            self._in_tbody = True
+        elif tag == "tr" and self._in_tbody:
+            self._in_row = True
+            self._current_row = []
+        elif tag == "td" and self._in_row:
+            self._in_cell = True
+            self._current_cell = ""
+
+    def handle_endtag(self, tag):
+        if tag == "td" and self._in_cell:
+            self._in_cell = False
+            self._current_row.append(self._current_cell.strip())
+        elif tag == "tr" and self._in_row:
+            self._in_row = False
+            if len(self._current_row) >= 4:
+                self.records.append(self._current_row)
+        elif tag == "tbody":
+            self._in_tbody = False
+        elif tag == "table":
+            self._in_table = False
+
+    def handle_data(self, data):
+        if self._in_cell:
+            self._current_cell += data
+
+
+def parse_address(place_of_business: str) -> dict:
+    """Parse a VIC register address into components."""
+    parts = place_of_business.strip()
+    # Try to extract postcode from the end
+    postcode_match = re.search(r'\b(\d{4})\s*$', parts)
+    postcode = postcode_match.group(1) if postcode_match else None
+
+    # Try to extract suburb (usually the last word(s) before postcode)
+    suburb = None
+    if postcode:
+        before_postcode = parts[:postcode_match.start()].strip().rstrip(",").strip()
+        # Last segment after comma is usually suburb
+        if "," in before_postcode:
+            suburb = before_postcode.split(",")[-1].strip()
+        else:
+            # Take last 1-2 words as suburb
+            words = before_postcode.split()
+            if len(words) >= 2:
+                suburb = " ".join(words[-2:]) if words[-1][0].isupper() else words[-1]
+
+    return {
+        "address": parts,
+        "suburb": suburb,
+        "state": "VIC",
+        "postcode": postcode,
+    }
+
+
+def crawl_letter(letter: str) -> list[dict]:
+    """Crawl all records for a single letter."""
+    url = f"{BASE_URL}?Letter={letter}"
+    html_text = fetch_url(url)
+
+    parser = VICTableParser()
+    parser.feed(html_text)
+
+    records = []
+    for row in parser.records:
+        # Columns: Name, Place of Business, Postcode, Postal Address, Phone
+        name = row[0] if len(row) > 0 else ""
+        place = row[1] if len(row) > 1 else ""
+        postcode = row[2] if len(row) > 2 else ""
+        postal = row[3] if len(row) > 3 else ""
+        phone = row[4] if len(row) > 4 else ""
+
+        if not name:
+            continue
+
+        records.append({
+            "name": name.strip(),
+            "place_of_business": place.strip(),
+            "postcode": postcode.strip(),
+            "postal_address": postal.strip(),
+            "phone": phone.strip(),
+        })
+
+    return records
+
+
+def make_source_id(record: dict) -> str:
+    """Create a stable source ID from name + address."""
+    name = record["name"].lower().strip()
+    addr = record["place_of_business"].lower().strip()
+    return f"{generate_slug(name)}_{record['postcode']}"
+
+
+def to_normalized(record: dict) -> dict:
+    """Convert a VIC register record to intermediate format."""
+    addr = parse_address(record["place_of_business"])
+
+    business = {
+        "name": record["name"],
+        "abn": None,
+        "phone": normalize_phone(record["phone"]),
+        "email": None,
+        "website": None,
+        "description": None,
+    }
+
+    locations = [{
+        "address": record["place_of_business"],
+        "suburb": addr["suburb"],
+        "state": "VIC",
+        "postcode": record["postcode"] or addr["postcode"],
+        "lat": None,
+        "lng": None,
+        "phone": normalize_phone(record["phone"]),
+    }]
+
+    source_id = make_source_id(record)
+    return to_intermediate(
+        source=SOURCE_NAME,
+        source_id=source_id,
+        source_url=f"{BASE_URL}?Letter={record['name'][0].upper()}",
+        business=business,
+        locations=locations,
+    )
+
+
+def run():
+    """Run the full VIC register crawl."""
+    db = get_db()
+    log_id = start_crawl_log(db, SOURCE_NAME)
+    print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
+
+    all_records = []
+    found = 0
+    new = 0
+    skipped = 0
+
+    try:
+        for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
+            print(f"  Crawling letter {letter}...", end=" ", flush=True)
+            records = crawl_letter(letter)
+            print(f"{len(records)} records")
+            all_records.extend(records)
+            found += len(records)
+
+            if letter != "Z":
+                time.sleep(CRAWL_DELAY)
+
+        # Store and normalize
+        for record in all_records:
+            source_id = make_source_id(record)
+            row_id = store_source_record(
+                db, SOURCE_NAME, source_id,
+                f"{BASE_URL}?Letter={record['name'][0].upper()}",
+                record, log_id
+            )
+            if row_id:
+                normalized = to_normalized(record)
+                db.execute(
+                    "UPDATE source_record SET normalized_data = ? WHERE id = ?",
+                    (json.dumps(normalized), row_id)
+                )
+                new += 1
+            else:
+                skipped += 1
+
+        db.commit()
+        finish_crawl_log(db, log_id, found, new, 0, skipped)
+        print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
+
+    except Exception as e:
+        finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
+        raise
+    finally:
+        db.close()
+
+    return all_records
+
+
+if __name__ == "__main__":
+    run()
--- a/crawlers/dedup.py
+++ b/crawlers/dedup.py
@@ -0,0 +1,425 @@
+"""Deduplication and merge engine.
+
+Processes source_records → funeral_brand + location + package entries.
+Handles cross-source matching and field-level merging.
+
+Matching hierarchy (strongest to weakest):
+1. source_key match — same record from same source (skip/update)
+2. ABN match — same business entity
+3. Name + Postcode exact match — likely same business
+4. Fuzzy name match (>85%) + same state — probable match, flag for review
+
+Merge priority (higher = preferred):
+  vic_register > funerals_australia > nfda > gathered_here
+
+Never overwrite verified provider data.
+"""
+
+import json
+import re
+import sqlite3
+from difflib import SequenceMatcher
+
+from base import get_db, generate_slug, normalize_state
+
+# Source priority for merge conflicts (higher number = more authoritative)
+SOURCE_PRIORITY = {
+    "vic_register": 40,
+    "funerals_australia": 30,
+    "nfda": 20,
+    "gathered_here": 10,
+}
+
+
+def normalize_name(name: str) -> str:
+    """Normalize a business name for comparison."""
+    name = name.strip().upper()
+    # Remove common suffixes
+    for suffix in [" PTY LTD", " PTY. LTD.", " P/L", " LIMITED",
+                   " PROPRIETARY LIMITED", " INC", " LLC",
+                   " FUNERAL DIRECTORS", " FUNERAL SERVICES",
+                   " FUNERALS", " FUNERAL HOME"]:
+        name = name.removesuffix(suffix)
+    # Remove punctuation
+    name = re.sub(r"[''`\".,&()-]", " ", name)
+    name = re.sub(r"\s+", " ", name).strip()
+    return name
+
+
+def fuzzy_match(name1: str, name2: str) -> float:
+    """Return similarity ratio between two names (0.0 to 1.0)."""
+    n1 = normalize_name(name1)
+    n2 = normalize_name(name2)
+    return SequenceMatcher(None, n1, n2).ratio()
+
+
+def find_existing_brand(db: sqlite3.Connection, record: dict) -> tuple[int | None, str]:
+    """Find a matching funeral_brand for a source record.
+
+    Returns (brand_id, match_type) or (None, 'new').
+    """
+    biz = record.get("business", {})
+    locs = record.get("locations", [])
+    name = biz.get("name", "")
+    abn = biz.get("abn")
+    source = record.get("source", "")
+    source_id = record.get("sourceId", "")
+    source_key = f"{source}:{source_id}"
+
+    postcode = None
+    state = None
+    if locs:
+        postcode = locs[0].get("postcode")
+        state = locs[0].get("state")
+
+    # 1. Source key match (exact same record from same source)
+    row = db.execute(
+        "SELECT id FROM funeral_brand WHERE source_key = ?",
+        (source_key,)
+    ).fetchone()
+    if row:
+        return row["id"], "source_key"
+
+    # 2. ABN match
+    if abn:
+        row = db.execute(
+            "SELECT id FROM funeral_brand WHERE abn = ?",
+            (abn,)
+        ).fetchone()
+        if row:
+            return row["id"], "abn"
+
+    # 3. Exact name + postcode match
+    if name and postcode:
+        norm = normalize_name(name)
+        # Check all brands — need fuzzy on name
+        rows = db.execute(
+            "SELECT id, title FROM funeral_brand WHERE business_postcode = ?",
+            (postcode,)
+        ).fetchall()
+        for row in rows:
+            if normalize_name(row["title"]) == norm:
+                return row["id"], "name_postcode"
+
+    # 4. Fuzzy name + same state
+    if name and state:
+        rows = db.execute(
+            "SELECT id, title FROM funeral_brand WHERE business_state = ?",
+            (state,)
+        ).fetchall()
+        for row in rows:
+            score = fuzzy_match(name, row["title"])
+            if score >= 0.85:
+                return row["id"], "fuzzy"
+
+    return None, "new"
+
+
+def merge_field(existing: str | None, new_val: str | None,
+                existing_priority: int, new_priority: int) -> str | None:
+    """Merge a single field, preferring non-null and higher-priority."""
+    if not new_val:
+        return existing
+    if not existing:
+        return new_val
+    # Both have values — prefer higher priority source
+    if new_priority > existing_priority:
+        return new_val
+    return existing
+
+
+def create_brand(db: sqlite3.Connection, record: dict) -> int:
+    """Create a new funeral_brand from a source record."""
+    biz = record.get("business", {})
+    locs = record.get("locations", [])
+    source = record.get("source", "")
+    source_id = record.get("sourceId", "")
+    source_key = f"{source}:{source_id}"
+
+    loc = locs[0] if locs else {}
+    slug = generate_slug(biz.get("name", "unknown"))
+
+    # Ensure unique slug
+    base_slug = slug
+    counter = 1
+    while True:
+        existing = db.execute(
+            "SELECT id FROM funeral_brand WHERE code = ?", (slug,)
+        ).fetchone()
+        if not existing:
+            break
+        slug = f"{base_slug}-{counter}"
+        counter += 1
+
+    cur = db.execute(
+        """INSERT INTO funeral_brand (
+            title, description, email, phone, website, abn, code,
+            hidden, verified, source_key, source_url, enrichment_status,
+            business_address, business_suburb, business_state, business_postcode
+        ) VALUES (?, ?, ?, ?, ?, ?, ?, 1, 0, ?, ?, 'pending', ?, ?, ?, ?)""",
+        (
+            biz.get("name"),
+            biz.get("description"),
+            biz.get("email"),
+            biz.get("phone"),
+            biz.get("website"),
+            biz.get("abn"),
+            slug,
+            source_key,
+            record.get("sourceUrl"),
+            loc.get("address"),
+            loc.get("suburb"),
+            loc.get("state"),
+            loc.get("postcode"),
+        )
+    )
+    brand_id = cur.lastrowid
+
+    # Create locations
+    for loc_data in locs:
+        title_parts = [loc_data.get("suburb", ""), loc_data.get("state", "")]
+        loc_title = ", ".join(p for p in title_parts if p) or biz.get("name", "")
+
+        db.execute(
+            """INSERT INTO location (
+                title, address, suburb, state, postcode, lat, lng, brand_id
+            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
+            (
+                loc_title,
+                loc_data.get("address"),
+                loc_data.get("suburb"),
+                loc_data.get("state"),
+                loc_data.get("postcode"),
+                loc_data.get("lat"),
+                loc_data.get("lng"),
+                brand_id,
+            )
+        )
+
+    # Create packages (from Gathered Here pricing)
+    packages = record.get("packages", [])
+    for pkg in packages:
+        if not pkg.get("price"):
+            continue
+        cur = db.execute(
+            """INSERT INTO package (
+                title, funeral_type, brand_id, source_url, extraction_confidence
+            ) VALUES (?, ?, ?, ?, ?)""",
+            (
+                pkg.get("name"),
+                pkg.get("funeralType"),
+                brand_id,
+                record.get("sourceUrl"),
+                0.8,  # Gathered Here pricing is structured, fairly reliable
+            )
+        )
+        pkg_id = cur.lastrowid
+
+        # Create inclusions if available
+        for inc in pkg.get("inclusions", []):
+            db.execute(
+                """INSERT INTO package_inclusion (
+                    price, optional, complimentary, inclusion_type_title, package_id
+                ) VALUES (?, ?, ?, ?, ?)""",
+                (
+                    inc.get("price", 0),
+                    1 if inc.get("optional") else 0,
+                    1 if inc.get("complimentary") else 0,
+                    inc.get("item", "Unknown"),
+                    pkg_id,
+                )
+            )
+
+    return brand_id
+
+
+def update_brand(db: sqlite3.Connection, brand_id: int,
+                 record: dict, match_type: str) -> bool:
+    """Merge new data into an existing brand. Returns True if updated."""
+    biz = record.get("business", {})
+    locs = record.get("locations", [])
+    source = record.get("source", "")
+    new_priority = SOURCE_PRIORITY.get(source, 0)
+
+    # Never overwrite verified providers
+    brand = db.execute(
+        "SELECT * FROM funeral_brand WHERE id = ?", (brand_id,)
+    ).fetchone()
+    if brand["verified"]:
+        return False
+
+    # Determine existing source priority
+    existing_source = ""
+    if brand["source_key"]:
+        existing_source = brand["source_key"].split(":")[0]
+    existing_priority = SOURCE_PRIORITY.get(existing_source, 0)
+
+    # Field-level merge — only fill blanks or upgrade from higher priority
+    updates = {}
+    field_map = {
+        "description": biz.get("description"),
+        "email": biz.get("email"),
+        "phone": biz.get("phone"),
+        "website": biz.get("website"),
+        "abn": biz.get("abn"),
+    }
+
+    for field, new_val in field_map.items():
+        merged = merge_field(brand[field], new_val, existing_priority, new_priority)
+        if merged != brand[field]:
+            updates[field] = merged
+
+    # Update location data if we have coords and existing doesn't
+    if locs:
+        loc = locs[0]
+        existing_locs = db.execute(
+            "SELECT * FROM location WHERE brand_id = ?", (brand_id,)
+        ).fetchall()
+
+        if not existing_locs and loc.get("suburb"):
+            title_parts = [loc.get("suburb", ""), loc.get("state", "")]
+            loc_title = ", ".join(p for p in title_parts if p)
+            db.execute(
+                """INSERT INTO location (
+                    title, address, suburb, state, postcode, lat, lng, brand_id
+                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
+                (
+                    loc_title, loc.get("address"), loc.get("suburb"),
+                    loc.get("state"), loc.get("postcode"),
+                    loc.get("lat"), loc.get("lng"), brand_id,
+                )
+            )
+        elif existing_locs:
+            # Update first location with coords if missing
+            eloc = existing_locs[0]
+            if not eloc["lat"] and loc.get("lat"):
+                db.execute(
+                    "UPDATE location SET lat = ?, lng = ? WHERE id = ?",
+                    (loc.get("lat"), loc.get("lng"), eloc["id"])
+                )
+
+    # Add packages if we have them and brand doesn't yet
+    packages = record.get("packages", [])
+    if packages:
+        existing_pkgs = db.execute(
+            "SELECT COUNT(*) as n FROM package WHERE brand_id = ?", (brand_id,)
+        ).fetchone()["n"]
+
+        if existing_pkgs == 0:
+            for pkg in packages:
+                if not pkg.get("price"):
+                    continue
+                cur = db.execute(
+                    """INSERT INTO package (
+                        title, funeral_type, brand_id, source_url
+                    ) VALUES (?, ?, ?, ?)""",
+                    (pkg.get("name"), pkg.get("funeralType"),
+                     brand_id, record.get("sourceUrl"))
+                )
+
+    if updates:
+        set_clause = ", ".join(f"{k} = ?" for k in updates)
+        values = list(updates.values()) + [brand_id]
+        db.execute(
+            f"UPDATE funeral_brand SET {set_clause}, updated_at = datetime('now') WHERE id = ?",
+            values
+        )
+        return True
+
+    return False
+
+
+def process_all():
+    """Process all source_records through deduplication and create brand entries.
+
+    Order matters: process higher-priority sources first so their data
+    forms the base record that lower-priority sources merge into.
+    """
+    db = get_db()
+
+    # Process in priority order (highest first)
+    sources_ordered = sorted(SOURCE_PRIORITY.keys(),
+                             key=lambda s: SOURCE_PRIORITY[s], reverse=True)
+
+    stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
+
+    print("=" * 60)
+    print("DEDUPLICATION ENGINE")
+    print("=" * 60)
+
+    for source in sources_ordered:
+        records = db.execute(
+            """SELECT id, normalized_data FROM source_record
+               WHERE source_name = ? AND normalized_data IS NOT NULL""",
+            (source,)
+        ).fetchall()
+
+        if not records:
+            continue
+
+        print(f"\n  Processing {source}: {len(records)} records")
+        source_stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
+
+        for row in records:
+            record = json.loads(row["normalized_data"])
+            brand_id, match_type = find_existing_brand(db, record)
+
+            if match_type == "new":
+                brand_id = create_brand(db, record)
+                source_stats["new"] += 1
+            elif match_type == "source_key":
+                source_stats["skipped"] += 1
+            else:
+                # Matched to existing — merge
+                updated = update_brand(db, brand_id, record, match_type)
+                if updated:
+                    source_stats["updated"] += 1
+                else:
+                    source_stats["matched"] += 1
+
+            # Update source_record with match info
+            db.execute(
+                """UPDATE source_record
+                   SET matched_brand_id = ?, match_type = ?, processed_at = datetime('now')
+                   WHERE id = ?""",
+                (brand_id, match_type, row["id"])
+            )
+
+        db.commit()
+        print(f"    New: {source_stats['new']}, Updated: {source_stats['updated']}, "
+              f"Matched: {source_stats['matched']}, Skipped: {source_stats['skipped']}")
+
+        for k, v in source_stats.items():
+            stats[k] += v
+
+    # Final summary
+    total_brands = db.execute("SELECT COUNT(*) as n FROM funeral_brand").fetchone()["n"]
+    total_locations = db.execute("SELECT COUNT(*) as n FROM location").fetchone()["n"]
+    total_packages = db.execute("SELECT COUNT(*) as n FROM package").fetchone()["n"]
+
+    print(f"\n{'=' * 60}")
+    print(f"DEDUP RESULTS")
+    print(f"{'=' * 60}")
+    print(f"  New brands created:    {stats['new']}")
+    print(f"  Existing updated:      {stats['updated']}")
+    print(f"  Matched (no change):   {stats['matched']}")
+    print(f"  Skipped (source_key):  {stats['skipped']}")
+    print(f"\n  Total brands in DB:    {total_brands}")
+    print(f"  Total locations in DB: {total_locations}")
+    print(f"  Total packages in DB:  {total_packages}")
+
+    # Show match type breakdown
+    print(f"\n  Match type breakdown:")
+    rows = db.execute(
+        """SELECT match_type, COUNT(*) as n
+           FROM source_record WHERE processed_at IS NOT NULL
+           GROUP BY match_type ORDER BY n DESC"""
+    ).fetchall()
+    for row in rows:
+        print(f"    {row['match_type']:15s} {row['n']:5d}")
+
+    db.close()
+
+
+if __name__ == "__main__":
+    process_all()
--- a/crawlers/discover_websites.py
+++ b/crawlers/discover_websites.py
@@ -0,0 +1,320 @@
+"""Website discovery module.
+
+For each provider without a website URL, attempts to find their website
+using multiple strategies (tried in order):
+
+1. Serper.dev (2,500 free Google searches, no CC needed)
+2. DuckDuckGo lite (free fallback, rate-limited)
+3. URL pattern guessing (businessname.com.au)
+
+Also validates discovered URLs to confirm they belong to the business.
+
+Configuration:
+  Set SERPER_API_KEY env var or in config.json to enable Serper.dev.
+  Without it, falls back to DuckDuckGo.
+"""
+
+import json
+import os
+import re
+import time
+import urllib.parse
+import urllib.request
+import urllib.error
+from pathlib import Path
+
+from base import (
+    fetch_url, get_db, normalize_phone, CRAWL_DELAY,
+)
+
+# Load Serper API key from env or config
+SERPER_API_KEY = os.environ.get("SERPER_API_KEY")
+if not SERPER_API_KEY:
+    config_path = Path(__file__).parent / "config.json"
+    if config_path.exists():
+        with open(config_path) as f:
+            config = json.load(f)
+            SERPER_API_KEY = config.get("serper_api_key")
+
+# Domains to skip when extracting search results
+SKIP_DOMAINS = [
+    "yellowpages", "whitepages", "truelocal", "yelp", "cylex",
+    "australia247", "showmelocal", "hotfrog", "localsearch",
+    "facebook.com", "linkedin.com", "instagram.com", "twitter.com",
+    "gatheredhere", "ezifunerals", "funeralocity", "funeraldirectory",
+    "deathsandfunerals", "mytributes", "obits.com",
+    "duckduckgo.com", "google.com", "bing.com",
+    "nfda.com.au", "funeralsaustralia.org",
+    "wikipedia.org", "youtube.com",
+]
+
+
+def search_serper(query: str) -> list[str]:
+    """Search via Serper.dev (Google results as JSON). 2,500 free queries."""
+    if not SERPER_API_KEY:
+        return []
+
+    url = "https://google.serper.dev/search"
+    data = json.dumps({"q": query, "gl": "au", "num": 10}).encode("utf-8")
+    req = urllib.request.Request(url, data=data, headers={
+        "X-API-KEY": SERPER_API_KEY,
+        "Content-Type": "application/json",
+    })
+
+    try:
+        with urllib.request.urlopen(req, timeout=15) as resp:
+            result = json.loads(resp.read().decode("utf-8"))
+    except Exception:
+        return []
+
+    results = []
+    for item in result.get("organic", []):
+        link = item.get("link", "")
+        if not link:
+            continue
+        if any(d in link.lower() for d in SKIP_DOMAINS):
+            continue
+        results.append(link)
+
+    return results
+
+
+def search_ddg(query: str) -> list[str]:
+    """Search DuckDuckGo lite and return result URLs (filtered)."""
+    encoded = urllib.parse.quote(query)
+    url = f"https://lite.duckduckgo.com/lite/?q={encoded}"
+
+    try:
+        html = fetch_url(url)
+    except Exception:
+        return []
+
+    # Extract redirect URLs from DDG lite format
+    raw_links = re.findall(
+        r'href="//duckduckgo\.com/l/\?uddg=([^&"]+)', html
+    )
+
+    results = []
+    for link in raw_links:
+        decoded = urllib.parse.unquote(link)
+        # Skip ads
+        if "ad_domain" in decoded or "ad_provider" in decoded:
+            continue
+        # Skip directory/aggregator sites
+        if any(d in decoded.lower() for d in SKIP_DOMAINS):
+            continue
+        results.append(decoded)
+
+    return results
+
+
+def validate_url(url: str, business_name: str) -> dict:
+    """Validate that a URL is a real website belonging to this business.
+
+    Returns: {valid: bool, confidence: str, reason: str}
+    """
+    try:
+        html = fetch_url(url, timeout=15)
+    except urllib.error.HTTPError as e:
+        return {"valid": False, "confidence": "none", "reason": f"HTTP {e.code}"}
+    except Exception as e:
+        return {"valid": False, "confidence": "none", "reason": str(e)[:100]}
+
+    html_lower = html.lower()
+
+    # Check if it's a parked/for-sale domain
+    parked_signals = ["domain is for sale", "buy this domain",
+                      "parked domain", "this domain", "godaddy",
+                      "domain parking"]
+    if any(s in html_lower for s in parked_signals):
+        return {"valid": False, "confidence": "none", "reason": "parked domain"}
+
+    # Check if the page mentions the business name
+    name_parts = business_name.lower().split()
+    # Require at least 2 name parts to match (or all if name is 1-2 words)
+    min_matches = min(2, len(name_parts))
+    matches = sum(1 for part in name_parts
+                  if len(part) > 2 and part in html_lower)
+
+    if matches >= min_matches:
+        return {"valid": True, "confidence": "confirmed", "reason": "name found in page"}
+
+    # Check title tag
+    title_match = re.search(r"<title[^>]*>(.*?)</title>", html, re.IGNORECASE | re.DOTALL)
+    if title_match:
+        title = title_match.group(1).lower()
+        if any(part in title for part in name_parts if len(part) > 2):
+            return {"valid": True, "confidence": "probable",
+                    "reason": "partial name in title"}
+
+    # Check for funeral-related content (it's at least a funeral business)
+    funeral_signals = ["funeral", "cremation", "burial", "memorial",
+                       "chapel", "obituar", "condolence"]
+    if any(s in html_lower for s in funeral_signals):
+        return {"valid": True, "confidence": "probable",
+                "reason": "funeral content found, name not confirmed"}
+
+    return {"valid": False, "confidence": "low",
+            "reason": "business name not found on page"}
+
+
+def guess_urls(business_name: str) -> list[str]:
+    """Generate candidate URLs from a business name."""
+    # Clean name for domain guessing
+    slug = business_name.lower().strip()
+    slug = re.sub(r"[''`]", "", slug)
+    slug = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug)
+    slug = re.sub(r"[^a-z0-9]+", "", slug)
+
+    # Also try hyphenated version
+    slug_hyphen = business_name.lower().strip()
+    slug_hyphen = re.sub(r"[''`]", "", slug_hyphen)
+    slug_hyphen = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug_hyphen)
+    slug_hyphen = re.sub(r"[^a-z0-9]+", "-", slug_hyphen).strip("-")
+
+    candidates = []
+    for s in [slug, slug_hyphen]:
+        if s:
+            candidates.append(f"https://www.{s}.com.au")
+            candidates.append(f"https://{s}.com.au")
+
+    return candidates
+
+
+def discover_website(name: str, suburb: str | None, state: str | None,
+                     phone: str | None = None) -> dict | None:
+    """Attempt to discover a business website.
+
+    Returns: {url, confidence, method, validation} or None.
+    """
+    # Build search query
+    query_parts = [name]
+    if suburb:
+        query_parts.append(suburb)
+    if state:
+        query_parts.append(state)
+    query = " ".join(query_parts)
+
+    # Strategy 1: Serper.dev (Google results, 2500 free)
+    results = search_serper(query)
+
+    # Strategy 2: DuckDuckGo fallback
+    if not results:
+        results = search_ddg(query)
+
+    for url in results[:3]:
+        validation = validate_url(url, name)
+        if validation["valid"]:
+            return {
+                "url": url.rstrip("/"),
+                "confidence": validation["confidence"],
+                "method": "search",
+                "validation": validation,
+            }
+        time.sleep(0.5)
+
+    # Strategy 2: URL guessing
+    candidates = guess_urls(name)
+    for url in candidates:
+        try:
+            validation = validate_url(url, name)
+            if validation["valid"]:
+                return {
+                    "url": url.rstrip("/"),
+                    "confidence": validation["confidence"],
+                    "method": "guess",
+                    "validation": validation,
+                }
+        except Exception:
+            continue
+        time.sleep(0.3)
+
+    return None
+
+
+def run(limit: int | None = None, state_filter: str | None = None):
+    """Discover websites for all providers without one.
+
+    Args:
+        limit: Max providers to process (for testing).
+        state_filter: Only process providers in this state.
+    """
+    db = get_db()
+
+    query = """
+        SELECT id, title, business_suburb, business_state, phone
+        FROM funeral_brand
+        WHERE website IS NULL AND verified = 0
+    """
+    params = []
+
+    if state_filter:
+        query += " AND business_state = ?"
+        params.append(state_filter)
+
+    query += " ORDER BY id"
+
+    if limit:
+        query += f" LIMIT {limit}"
+
+    providers = db.execute(query, params).fetchall()
+    print(f"Providers without websites: {len(providers)}")
+
+    found = 0
+    not_found = 0
+
+    for i, prov in enumerate(providers):
+        name = prov["title"]
+        suburb = prov["business_suburb"]
+        state = prov["business_state"]
+        phone = prov["phone"]
+
+        if (i + 1) % 10 == 0 or i == 0:
+            print(f"  [{i+1}/{len(providers)}] Processing: {name}")
+
+        result = discover_website(name, suburb, state, phone)
+
+        if result:
+            db.execute(
+                """UPDATE funeral_brand
+                   SET website = ?, updated_at = datetime('now')
+                   WHERE id = ?""",
+                (result["url"], prov["id"])
+            )
+            found += 1
+            if (i + 1) <= 20 or result["confidence"] == "confirmed":
+                print(f"    FOUND ({result['confidence']}, {result['method']}): "
+                      f"{result['url']}")
+        else:
+            not_found += 1
+
+        if (i + 1) % 20 == 0:
+            db.commit()
+
+        # Rate limit: ~2s between providers (DDG + validation requests)
+        time.sleep(CRAWL_DELAY * 2)
+
+    db.commit()
+    print(f"\nDone: {found} websites found, {not_found} not found")
+    print(f"  Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
+
+    db.close()
+
+
+if __name__ == "__main__":
+    import sys
+    limit = None
+    state = None
+
+    for arg in sys.argv[1:]:
+        if arg.startswith("--state="):
+            state = arg.split("=")[1]
+        elif arg.startswith("--limit="):
+            limit = int(arg.split("=")[1])
+        else:
+            try:
+                limit = int(arg)
+            except ValueError:
+                pass
+
+    run(limit=limit, state_filter=state)
--- a/crawlers/enrich_websites.py
+++ b/crawlers/enrich_websites.py
@@ -0,0 +1,393 @@
+"""Website enrichment module.
+
+For each provider with a website but no packages yet, crawls their site
+to find pricing/packages pages and extracts structured data.
+
+Two extraction modes:
+1. Direct HTML parsing (for sites with clear pricing structure)
+2. AI extraction via API call (for complex/varied layouts)
+
+This module handles the crawling and page discovery.
+AI extraction is delegated to the N8N workflow (Claude Haiku node).
+"""
+
+import json
+import re
+import time
+import urllib.parse
+import urllib.error
+from pathlib import Path
+
+from base import fetch_url, get_db, CRAWL_DELAY
+
+# Common URL patterns for pricing/packages pages
+PRICING_PATHS = [
+    "/pricing",
+    "/prices",
+    "/our-prices",
+    "/packages",
+    "/funeral-packages",
+    "/services",
+    "/our-services",
+    "/funeral-costs",
+    "/funeral-services",
+    "/service-options",
+    "/price-list",
+    "/transparency",
+    "/funeral-pricing",
+    "/costs",
+    "/cremation",
+    "/cremation-packages",
+    "/burial",
+    "/plan-a-funeral",
+    "/arrange",
+]
+
+# Keywords that suggest a link leads to pricing
+PRICING_KEYWORDS = [
+    "pric", "cost", "packag", "service", "plan",
+    "cremation", "burial", "funeral",
+    "transparency", "disclosure",
+]
+
+
+def find_pricing_page(base_url: str, homepage_html: str) -> str | None:
+    """Try to find the pricing/packages page URL.
+
+    Strategy:
+    1. Try common URL patterns
+    2. Parse homepage links for pricing-related keywords
+    """
+    base = base_url.rstrip("/")
+
+    # Strategy 1: Try common paths
+    for path in PRICING_PATHS:
+        test_url = base + path
+        try:
+            html = fetch_url(test_url, timeout=10)
+            # Verify it's not a 404 soft-redirect (check for pricing content)
+            if len(html) > 1000 and ("$" in html or "price" in html.lower()):
+                return test_url
+        except (urllib.error.HTTPError, urllib.error.URLError, Exception):
+            continue
+        time.sleep(0.3)
+
+    # Strategy 2: Parse homepage links
+    link_pattern = re.compile(
+        r'<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>',
+        re.IGNORECASE | re.DOTALL
+    )
+
+    for match in link_pattern.finditer(homepage_html):
+        href = match.group(1)
+        text = re.sub(r"<[^>]+>", "", match.group(2)).lower().strip()
+        href_lower = href.lower()
+
+        # Check if link text or URL contains pricing keywords
+        if any(kw in text or kw in href_lower for kw in PRICING_KEYWORDS):
+            # Resolve relative URLs
+            if href.startswith("/"):
+                full_url = base + href
+            elif href.startswith("http"):
+                # Only follow links to the same domain
+                if urllib.parse.urlparse(base).netloc in href:
+                    full_url = href
+                else:
+                    continue
+            else:
+                full_url = base + "/" + href
+
+            try:
+                html = fetch_url(full_url, timeout=10)
+                if len(html) > 500:
+                    return full_url
+            except Exception:
+                continue
+            time.sleep(0.3)
+
+    return None
+
+
+def extract_description(html: str) -> str | None:
+    """Extract a business description from homepage HTML."""
+    # Try meta description first
+    meta_match = re.search(
+        r'<meta\s+(?:name="description"\s+content="([^"]+)"|content="([^"]+)"\s+name="description")',
+        html, re.IGNORECASE
+    )
+    if meta_match:
+        desc = meta_match.group(1) or meta_match.group(2)
+        if desc and len(desc) > 20:
+            return desc.strip()
+
+    # Try OG description
+    og_match = re.search(
+        r'<meta\s+property="og:description"\s+content="([^"]+)"',
+        html, re.IGNORECASE
+    )
+    if og_match and len(og_match.group(1)) > 20:
+        return og_match.group(1).strip()
+
+    return None
+
+
+def extract_contact_info(html: str) -> dict:
+    """Extract contact details from HTML."""
+    info = {}
+
+    # Phone
+    phone_match = re.search(r'href="tel:([^"]+)"', html)
+    if phone_match:
+        info["phone"] = phone_match.group(1).strip()
+
+    # Email
+    email_match = re.search(r'href="mailto:([^"?]+)"', html)
+    if email_match:
+        info["email"] = email_match.group(1).strip()
+
+    # Address from JSON-LD
+    addr_match = re.search(r'"streetAddress"\s*:\s*"([^"]*)"', html)
+    if addr_match:
+        info["address"] = addr_match.group(1)
+
+    return info
+
+
+def check_has_pricing(html: str) -> bool:
+    """Quick check whether a page contains pricing information."""
+    # Look for dollar signs near numbers
+    price_pattern = re.compile(r'\$[\d,]+(?:\.\d{2})?')
+    prices_found = price_pattern.findall(html)
+
+    # Filter out tiny amounts (likely not funeral pricing)
+    significant_prices = []
+    for p in prices_found:
+        cleaned = p.replace("$", "").replace(",", "").strip()
+        if not cleaned:
+            continue
+        try:
+            amount = float(cleaned)
+        except ValueError:
+            continue
+        if amount >= 100:
+            significant_prices.append(amount)
+
+    return len(significant_prices) >= 1
+
+
+def prepare_for_ai_extraction(html: str) -> str:
+    """Clean HTML for AI extraction — remove noise, keep content."""
+    # Remove script and style tags
+    cleaned = re.sub(r"<script[^>]*>.*?</script>", "", html,
+                     flags=re.DOTALL | re.IGNORECASE)
+    cleaned = re.sub(r"<style[^>]*>.*?</style>", "", cleaned,
+                     flags=re.DOTALL | re.IGNORECASE)
+
+    # Remove HTML comments
+    cleaned = re.sub(r"<!--.*?-->", "", cleaned, flags=re.DOTALL)
+
+    # Remove nav, header, footer elements
+    for tag in ["nav", "header", "footer"]:
+        cleaned = re.sub(
+            rf"<{tag}[^>]*>.*?</{tag}>", "", cleaned,
+            flags=re.DOTALL | re.IGNORECASE
+        )
+
+    # Strip remaining tags but keep text
+    text = re.sub(r"<[^>]+>", " ", cleaned)
+    # Collapse whitespace
+    text = re.sub(r"\s+", " ", text).strip()
+
+    # Truncate to ~8000 chars (fits well within Haiku context)
+    if len(text) > 8000:
+        text = text[:8000] + "..."
+
+    return text
+
+
+def enrich_provider(provider_id: int, website: str, db) -> dict:
+    """Crawl a provider's website and extract enrichment data.
+
+    Returns a dict with what was found.
+    """
+    result = {
+        "homepage_fetched": False,
+        "description": None,
+        "contact_info": {},
+        "pricing_page_url": None,
+        "has_pricing": False,
+        "pricing_page_text": None,  # cleaned text for AI extraction
+        "pdf_links": [],
+    }
+
+    # Step 1: Fetch homepage
+    try:
+        homepage = fetch_url(website, timeout=15)
+        result["homepage_fetched"] = True
+    except Exception as e:
+        result["error"] = str(e)[:200]
+        return result
+
+    # Step 2: Extract description and contact info
+    result["description"] = extract_description(homepage)
+    result["contact_info"] = extract_contact_info(homepage)
+
+    # Step 3: Find pricing page
+    time.sleep(CRAWL_DELAY)
+    pricing_url = find_pricing_page(website, homepage)
+
+    if pricing_url:
+        result["pricing_page_url"] = pricing_url
+        try:
+            pricing_html = fetch_url(pricing_url, timeout=15)
+            result["has_pricing"] = check_has_pricing(pricing_html)
+            result["pricing_page_text"] = prepare_for_ai_extraction(pricing_html)
+
+            # Check for PDF links
+            pdf_links = re.findall(
+                r'href="([^"]*\.pdf[^"]*)"', pricing_html, re.IGNORECASE
+            )
+            for pdf_href in pdf_links:
+                if pdf_href.startswith("/"):
+                    pdf_href = website.rstrip("/") + pdf_href
+                elif not pdf_href.startswith("http"):
+                    pdf_href = website.rstrip("/") + "/" + pdf_href
+                result["pdf_links"].append(pdf_href)
+
+        except Exception:
+            pass
+    else:
+        # Check homepage itself for pricing
+        if check_has_pricing(homepage):
+            result["has_pricing"] = True
+            result["pricing_page_url"] = website
+            result["pricing_page_text"] = prepare_for_ai_extraction(homepage)
+
+    return result
+
+
+def run(limit: int | None = None, state_filter: str | None = None):
+    """Enrich all providers that have a website but no packages."""
+    db = get_db()
+
+    query = """
+        SELECT fb.id, fb.title, fb.website, fb.business_state
+        FROM funeral_brand fb
+        LEFT JOIN package p ON p.brand_id = fb.id
+        WHERE fb.website IS NOT NULL
+          AND fb.verified = 0
+          AND p.id IS NULL
+    """
+    params = []
+
+    if state_filter:
+        query += " AND fb.business_state = ?"
+        params.append(state_filter)
+
+    query += " ORDER BY fb.id"
+
+    if limit:
+        query += f" LIMIT {limit}"
+
+    providers = db.execute(query, params).fetchall()
+    print(f"Providers to enrich: {len(providers)}")
+
+    enriched = 0
+    pricing_found = 0
+    failed = 0
+
+    for i, prov in enumerate(providers):
+        if (i + 1) % 5 == 0 or i == 0:
+            print(f"  [{i+1}/{len(providers)}] {prov['title']}")
+
+        result = enrich_provider(prov["id"], prov["website"], db)
+
+        if not result["homepage_fetched"]:
+            failed += 1
+            db.execute(
+                """UPDATE funeral_brand
+                   SET enrichment_status = 'failed', updated_at = datetime('now')
+                   WHERE id = ?""",
+                (prov["id"],)
+            )
+            continue
+
+        enriched += 1
+
+        # Update brand with discovered info
+        updates = {}
+        if result["description"] and not db.execute(
+            "SELECT description FROM funeral_brand WHERE id = ?", (prov["id"],)
+        ).fetchone()["description"]:
+            updates["description"] = result["description"]
+
+        contact = result["contact_info"]
+        brand = db.execute("SELECT * FROM funeral_brand WHERE id = ?",
+                           (prov["id"],)).fetchone()
+        if contact.get("email") and not brand["email"]:
+            updates["email"] = contact["email"]
+        if contact.get("phone") and not brand["phone"]:
+            updates["phone"] = contact["phone"]
+
+        if result["has_pricing"]:
+            pricing_found += 1
+            updates["enrichment_status"] = "partial"  # has pricing, needs AI extraction
+        else:
+            updates["enrichment_status"] = "partial"  # homepage enriched, no pricing
+
+        if updates:
+            set_parts = [f"{k} = ?" for k in updates]
+            values = list(updates.values()) + [prov["id"]]
+            db.execute(
+                f"UPDATE funeral_brand SET {', '.join(set_parts)}, "
+                f"updated_at = datetime('now') WHERE id = ?",
+                values
+            )
+
+        # Store pricing page text for later AI extraction
+        if result["pricing_page_text"]:
+            db.execute(
+                """INSERT OR REPLACE INTO source_record
+                   (source_name, source_id, source_url, raw_data,
+                    matched_brand_id, match_type)
+                   VALUES ('website_crawl', ?, ?, ?, ?, 'enrichment')""",
+                (
+                    f"brand_{prov['id']}",
+                    result["pricing_page_url"],
+                    json.dumps({
+                        "pricing_text": result["pricing_page_text"],
+                        "pdf_links": result["pdf_links"],
+                        "has_pricing": result["has_pricing"],
+                    }),
+                    prov["id"],
+                )
+            )
+
+        if (i + 1) % 10 == 0:
+            db.commit()
+
+        time.sleep(CRAWL_DELAY)
+
+    db.commit()
+    print(f"\nDone: {enriched} enriched, {pricing_found} with pricing, {failed} failed")
+
+    db.close()
+
+
+if __name__ == "__main__":
+    import sys
+    limit = None
+    state = None
+
+    for arg in sys.argv[1:]:
+        if arg.startswith("--state="):
+            state = arg.split("=")[1]
+        elif arg.startswith("--limit="):
+            limit = int(arg.split("=")[1])
+        else:
+            try:
+                limit = int(arg)
+            except ValueError:
+                pass
+
+    run(limit=limit, state_filter=state)
--- a/crawlers/lookup_abn.py
+++ b/crawlers/lookup_abn.py
@@ -0,0 +1,199 @@
+"""ABN Lookup module via the Australian Business Register (ABR) API.
+
+Enriches providers with their ABN (strongest dedup key) and validates
+that they are active registered businesses.
+
+The ABR API is FREE. Requires a GUID (authentication token) from:
+  https://abr.business.gov.au/Tools/WebServices
+
+Configuration:
+  Set ABR_GUID env var or in config.json.
+"""
+
+import json
+import os
+import re
+import time
+import urllib.parse
+import xml.etree.ElementTree as ET
+
+from base import fetch_url, get_db, CRAWL_DELAY
+
+# Load ABR GUID from env or config
+ABR_GUID = os.environ.get("ABR_GUID")
+if not ABR_GUID:
+    config_path = os.path.join(os.path.dirname(__file__), "config.json")
+    if os.path.exists(config_path):
+        with open(config_path) as f:
+            config = json.load(f)
+            ABR_GUID = config.get("abr_guid")
+
+ABR_BASE = "https://abr.business.gov.au/abrxmlsearch/AbrXmlSearch.asmx"
+
+
+def search_by_name(name: str, state: str | None = None,
+                   postcode: str | None = None) -> list[dict]:
+    """Search ABR by business name. Returns matching records."""
+    if not ABR_GUID:
+        print("  WARNING: ABR_GUID not configured. Skipping ABN lookup.")
+        return []
+
+    params = {
+        "name": name,
+        "postcode": postcode or "",
+        "legalName": "Y",
+        "tradingName": "Y",
+        "NSW": "Y", "SA": "Y", "ACT": "Y", "VIC": "Y",
+        "WA": "Y", "NT": "Y", "QLD": "Y", "TAS": "Y",
+        "authenticationGuid": ABR_GUID,
+    }
+
+    # If state specified, only search that state
+    if state:
+        for s in ["NSW", "SA", "ACT", "VIC", "WA", "NT", "QLD", "TAS"]:
+            params[s] = "Y" if s == state else "N"
+
+    url = f"{ABR_BASE}/ABRSearchByNameSimpleProtocol"
+    try:
+        text = fetch_url(url, method="GET", data=params, timeout=15)
+    except Exception as e:
+        return []
+
+    # Parse XML response
+    results = []
+    try:
+        root = ET.fromstring(text)
+        # The ABR response uses a default namespace
+        ns = {"abr": "http://abr.business.gov.au/ABRXMLSearch/"}
+
+        for record in root.findall(".//abr:searchResultsRecord", ns):
+            abn_elem = record.find(".//abr:ABN/abr:identifierValue", ns)
+            status_elem = record.find(".//abr:ABN/abr:identifierStatus", ns)
+            name_elem = (
+                record.find(".//abr:mainName/abr:organisationName", ns)
+                or record.find(".//abr:mainTradingName/abr:organisationName", ns)
+                or record.find(".//abr:businessName/abr:organisationName", ns)
+            )
+            state_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:stateCode", ns)
+            postcode_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:postcode", ns)
+            score_elem = record.find(".//abr:nameScore", ns)
+
+            if abn_elem is not None:
+                results.append({
+                    "abn": abn_elem.text,
+                    "status": status_elem.text if status_elem is not None else None,
+                    "name": name_elem.text if name_elem is not None else None,
+                    "state": state_elem.text if state_elem is not None else None,
+                    "postcode": postcode_elem.text if postcode_elem is not None else None,
+                    "score": int(score_elem.text) if score_elem is not None else 0,
+                })
+    except ET.ParseError:
+        return []
+
+    return results
+
+
+def find_best_match(name: str, state: str | None = None,
+                    postcode: str | None = None) -> dict | None:
+    """Find the best ABR match for a business name.
+
+    Returns the highest-scoring active match, or None.
+    """
+    results = search_by_name(name, state, postcode)
+
+    # Filter to active businesses
+    active = [r for r in results if r.get("status") == "Active"]
+    if not active:
+        return None
+
+    # Sort by score descending
+    active.sort(key=lambda r: r.get("score", 0), reverse=True)
+
+    # Return best match if score is reasonable
+    best = active[0]
+    if best.get("score", 0) >= 80:
+        return best
+
+    return None
+
+
+def run(limit: int | None = None, state_filter: str | None = None):
+    """Look up ABNs for all providers that don't have one."""
+    db = get_db()
+
+    query = """
+        SELECT id, title, business_state, business_postcode
+        FROM funeral_brand
+        WHERE abn IS NULL AND verified = 0
+    """
+    params = []
+
+    if state_filter:
+        query += " AND business_state = ?"
+        params.append(state_filter)
+
+    query += " ORDER BY id"
+
+    if limit:
+        query += f" LIMIT {limit}"
+
+    providers = db.execute(query, params).fetchall()
+    print(f"Providers without ABN: {len(providers)}")
+
+    if not ABR_GUID:
+        print("ERROR: ABR_GUID not configured.")
+        print("  Register at: https://abr.business.gov.au/Tools/WebServices")
+        print("  Then set ABR_GUID env var or add 'abr_guid' to config.json")
+        return
+
+    found = 0
+    not_found = 0
+
+    for i, prov in enumerate(providers):
+        if (i + 1) % 20 == 0 or i == 0:
+            print(f"  [{i+1}/{len(providers)}] {prov['title']}")
+
+        match = find_best_match(
+            prov["title"],
+            prov["business_state"],
+            prov["business_postcode"]
+        )
+
+        if match:
+            db.execute(
+                "UPDATE funeral_brand SET abn = ?, updated_at = datetime('now') WHERE id = ?",
+                (match["abn"], prov["id"])
+            )
+            found += 1
+        else:
+            not_found += 1
+
+        if (i + 1) % 50 == 0:
+            db.commit()
+
+        time.sleep(0.5)  # Be gentle with the government API
+
+    db.commit()
+    print(f"\nDone: {found} ABNs found, {not_found} not found")
+    print(f"  Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
+
+    db.close()
+
+
+if __name__ == "__main__":
+    import sys
+    limit = None
+    state = None
+
+    for arg in sys.argv[1:]:
+        if arg.startswith("--state="):
+            state = arg.split("=")[1]
+        elif arg.startswith("--limit="):
+            limit = int(arg.split("=")[1])
+        else:
+            try:
+                limit = int(arg)
+            except ValueError:
+                pass
+
+    run(limit=limit, state_filter=state)
--- a/crawlers/run_overnight.sh
+++ b/crawlers/run_overnight.sh
@@ -0,0 +1,111 @@
+#!/bin/bash
+# Full pipeline overnight run
+# Usage: ./run_overnight.sh
+#
+# Before running:
+#   1. Add your Serper API key to config.json
+#   2. Optionally add your Anthropic API key for AI extraction
+#
+# This script runs all steps sequentially and logs everything.
+
+set -e
+cd "$(dirname "$0")"
+
+LOG="../logs/overnight_$(date +%Y%m%d_%H%M%S).log"
+mkdir -p ../logs
+
+echo "=== OVERNIGHT PIPELINE RUN ===" | tee "$LOG"
+echo "Started: $(date)" | tee -a "$LOG"
+echo "" | tee -a "$LOG"
+
+# Check config
+SERPER_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('serper_api_key') or '')")
+ANTHROPIC_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('anthropic_api_key') or '')")
+
+if [ -z "$SERPER_KEY" ]; then
+    echo "WARNING: No Serper API key — website discovery will use DDG (slower, lower hit rate)" | tee -a "$LOG"
+else
+    echo "Serper API key: configured" | tee -a "$LOG"
+fi
+
+if [ -z "$ANTHROPIC_KEY" ]; then
+    echo "WARNING: No Anthropic API key — AI extraction will be skipped" | tee -a "$LOG"
+else
+    echo "Anthropic API key: configured" | tee -a "$LOG"
+fi
+echo "" | tee -a "$LOG"
+
+# Step 1: Source crawlers
+echo "=== STEP 1: Source Crawlers ===" | tee -a "$LOG"
+echo "[$(date +%H:%M:%S)] Running VIC Register crawler..." | tee -a "$LOG"
+python3 crawl_vic_register.py 2>&1 | tee -a "$LOG"
+
+echo "[$(date +%H:%M:%S)] Running Funerals Australia crawler..." | tee -a "$LOG"
+python3 crawl_funerals_australia.py 2>&1 | tee -a "$LOG"
+
+echo "[$(date +%H:%M:%S)] Running NFDA crawler..." | tee -a "$LOG"
+python3 crawl_nfda.py 2>&1 | tee -a "$LOG"
+
+# Step 2: Deduplication
+echo "" | tee -a "$LOG"
+echo "=== STEP 2: Deduplication ===" | tee -a "$LOG"
+echo "[$(date +%H:%M:%S)] Running dedup..." | tee -a "$LOG"
+python3 dedup.py 2>&1 | tee -a "$LOG"
+
+# Step 3: Website discovery (all providers without one)
+echo "" | tee -a "$LOG"
+echo "=== STEP 3: Website Discovery ===" | tee -a "$LOG"
+NEED_WEBSITE=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NULL AND verified=0').fetchone()[0])")
+echo "[$(date +%H:%M:%S)] Providers needing websites: $NEED_WEBSITE" | tee -a "$LOG"
+
+# Process in batches of 200 to avoid issues
+BATCH=200
+OFFSET=0
+while [ $OFFSET -lt $NEED_WEBSITE ]; do
+    REMAINING=$((NEED_WEBSITE - OFFSET))
+    CURRENT=$((REMAINING < BATCH ? REMAINING : BATCH))
+    echo "[$(date +%H:%M:%S)] Discovering websites batch $((OFFSET/BATCH + 1)) ($CURRENT providers)..." | tee -a "$LOG"
+    python3 discover_websites.py --limit=$CURRENT 2>&1 | tee -a "$LOG"
+    OFFSET=$((OFFSET + BATCH))
+    # Brief pause between batches
+    sleep 5
+done
+
+# Step 4: Website enrichment (all with website, not yet enriched)
+echo "" | tee -a "$LOG"
+echo "=== STEP 4: Website Enrichment ===" | tee -a "$LOG"
+NEED_ENRICH=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL AND enrichment_status=\"pending\" AND verified=0').fetchone()[0])")
+echo "[$(date +%H:%M:%S)] Providers needing enrichment: $NEED_ENRICH" | tee -a "$LOG"
+python3 enrich_websites.py --limit=$NEED_ENRICH 2>&1 | tee -a "$LOG"
+
+# Step 5: Compute tiers
+echo "" | tee -a "$LOG"
+echo "=== STEP 5: Compute Tiers ===" | tee -a "$LOG"
+python3 compute_tiers.py 2>&1 | tee -a "$LOG"
+
+# Final summary
+echo "" | tee -a "$LOG"
+echo "=== FINAL SUMMARY ===" | tee -a "$LOG"
+python3 -c "
+from base import get_db
+db = get_db()
+print('Database Status:')
+print(f'  Total providers:     {db.execute(\"SELECT COUNT(*) FROM funeral_brand\").fetchone()[0]}')
+print(f'  With phone:          {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE phone IS NOT NULL\").fetchone()[0]}')
+print(f'  With email:          {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE email IS NOT NULL\").fetchone()[0]}')
+print(f'  With website:        {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL\").fetchone()[0]}')
+print(f'  With description:    {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE description IS NOT NULL\").fetchone()[0]}')
+print()
+print('Listing Tiers:')
+for row in db.execute('SELECT listing_tier, COUNT(*) as n FROM funeral_brand GROUP BY listing_tier ORDER BY n DESC'):
+    print(f'  {row[0]:12s} {row[1]:>6d}')
+print()
+print('Pricing Pages:')
+print(f'  Total crawled:       {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\'\").fetchone()[0]}')
+print(f'  With pricing:        {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.has_pricing\\')=1\").fetchone()[0]}')
+print(f'  With PDF links:      {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.pdf_links\\') != \\'[]\\'\").fetchone()[0]}')
+" 2>&1 | tee -a "$LOG"
+
+echo "" | tee -a "$LOG"
+echo "Finished: $(date)" | tee -a "$LOG"
+echo "Log saved to: $LOG"