Initial commit: funeral provider discovery pipeline

Python crawlers for VIC Register, Funerals Australia, NFDA
n8n workflows for scheduled discovery and enrichment
SQLite schema and seeded dev database (1,463 providers)
End-to-end process documentation in n8n/PROCESS.md
This commit is contained in:
Richie
2026-04-24 10:27:08 +10:00
commit cc91427789
30 changed files with 4706 additions and 0 deletions

215
crawlers/PIPELINE.md Normal file
View File

@@ -0,0 +1,215 @@
# Provider Discovery & Enrichment Pipeline
## Architecture: Multi-Step Enrichment
The pipeline builds provider profiles progressively, never relying on
competitor data. Each step adds richer detail from more authoritative sources.
```
STEP 1: DISCOVER STEP 2: FIND WEBSITE STEP 3: ENRICH
───────────────── ──────────────────── ──────────────
VIC Register ─────┐ ┌─ Fetch homepage
NFDA Directory ───┼─▶ Basic Google Places API ──┐ │ Find /pricing page
Funerals AU ──────┘ Provider ABN Lookup ─────────┼─▶ URL ──┤ Download PDFs
Record Search engines ─────┘ │ AI extract packages
└─▶ Structured data
name website URL description
address Google rating packages[]
phone Google reviews inclusions[]
email place_id pricing
state ABN (validated)
```
## Step 1: Discovery (DONE — all modules built and tested)
Sources:
- VIC Consumer Affairs Register (796 records, VIC only) → `crawl_vic_register.py`
- Funerals Australia AJAX API (997 records, national) → `crawl_funerals_australia.py`
- NFDA WPSL API (209 records, national) → `crawl_nfda.py`
Orchestrator: `crawl_all.py`
Deduplication: `dedup.py` (fuzzy name + postcode + ABN matching)
Output: ~1,463 unique providers with basic contact info.
Stored in: funeral_brand + location tables in `database/providers.db`.
## Step 2: Website Discovery (DONE — module built and tested)
Module: `discover_websites.py`
Test result: 50% success rate on initial batch (DDG search + URL guessing)
Can be improved with Google Places API for higher hit rate.
For each provider that lacks a website URL:
### 2a. Serper.dev — Google search API (PRIMARY)
- Input: "{business name} {suburb} {state}"
- Returns: Google organic search results as JSON (title, link, snippet)
- Cost: **2,500 free queries** (no CC needed), then $1/1K
- Covers our entire 1,463 providers for $0
- Filters out directories/aggregators, validates first result
- Module: `discover_websites.py` with `search_serper()`
### 2b. DuckDuckGo lite (FALLBACK)
- Free, no API key, but aggressive rate limiting
- Used when Serper key not configured or quota exhausted
- Module: `discover_websites.py` with `search_ddg()`
### 2c. URL pattern guessing (SUPPLEMENTARY)
- Generates candidate domains from business name (e.g. smithfunerals.com.au)
- HTTP HEAD to check if live, then validate content
- Module: `discover_websites.py` with `guess_urls()`
### 2d. ABN Lookup — Australian Business Register (ENRICHMENT)
- Input: business name + state
- Returns: ABN, entity status, registered state/postcode
- Cost: **FREE** (government API, requires GUID registration)
- Validates business is active, gives strongest dedup key
- Does NOT return website URLs
- Module: `lookup_abn.py`
- Register for GUID: https://abr.business.gov.au/Tools/WebServices
### 2e. Google Places API (OPTIONAL PREMIUM)
- Input: "{business name}, {suburb} {state}"
- Returns: website, rating, review count, place_id, formatted phone
- Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
- Best data quality but most expensive
- Not yet implemented — add when budget allows
### 2f. URL validation
- Fetch discovered URL, verify it loads
- Check page title/content mentions the business name
- Reject generic directories (yellowpages, truelocal, etc.)
- Mark confidence level: confirmed / probable / unverified
## Step 3: Website Enrichment (DONE — module built and tested)
Module: `enrich_websites.py`
- Finds pricing pages via 20+ URL patterns + link following
- Extracts description from meta tags
- Extracts contact info (phone, email, address)
- Stores cleaned pricing page text for AI extraction
- Detects PDF links for PDF-based pricing extraction
For each provider with a confirmed website:
### 3a. Homepage crawl
- Fetch homepage HTML
- Extract: description/about text, contact details
- Look for links to pricing/services pages
### 3b. Pricing page discovery
Try common URL patterns:
/pricing, /prices, /packages, /services, /our-services,
/funeral-costs, /funeral-packages, /service-options,
/price-list, /transparency
Also:
- Parse sitemap.xml if available
- Follow links containing "pric", "packag", "cost", "service"
- Check for PDF links on pricing pages
### 3c. AI extraction (Claude Haiku)
- Send pricing page HTML to Haiku
- Extract: package names, funeral types, prices, inclusions
- Map to known inclusion types where possible
- Return confidence score
### 3d. PDF extraction (for InvoCare-type sites)
- Download compliance PDFs
- Extract text (pdftotext or similar)
- Send to Haiku for structured extraction
- ~25% of sites are PDF-only for pricing
## Listing Tiers
Providers are assigned a `listing_tier` based on data quality. Computed
automatically by `compute_tiers.py` after each enrichment run.
| Tier | Label | Criteria | Display |
|------|-------|----------|---------|
| `verified` | Full partner | `verified = true` (signed up) | Full branding, packages, arrangements |
| `priced` | Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail |
| `estimated` | Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns |
| `listed` | Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt |
Each tier below `verified` motivates the provider to sign up:
- `listed` → "Publish your pricing to attract more families"
- `estimated` → "Add detailed breakdowns to stand out"
- `priced` → "Sign up to enable online arrangements"
## Enrichment Status Flow
```
pending ──▶ website_found ──▶ partial ──▶ complete
│ │ │
└──▶ no_website_found failed (retry later)
```
## N8N Workflow Design
### Workflow 1: Weekly Discovery
Cron → Run all source crawlers → Dedup into DB → Queue new providers
### Workflow 2: Daily Website Discovery
Cron → Fetch providers with no website → Google Places lookup
→ ABN lookup → Search fallback → Update DB
### Workflow 3: Daily Enrichment
Cron → Fetch providers with website but no packages
→ Crawl website → AI extract → Update DB
### Workflow 4: Monthly Re-check
Cron → Re-crawl enriched providers → Update pricing if changed
---
## Module Inventory
| Module | Purpose | N8N Workflow |
|--------|---------|-------------|
| `base.py` | Shared HTTP, DB, normalization utils | Used by all |
| `crawl_vic_register.py` | VIC government register (796 records) | Workflow 1 |
| `crawl_funerals_australia.py` | Funerals Australia API (997 records) | Workflow 1 |
| `crawl_nfda.py` | NFDA directory API (209 records) | Workflow 1 |
| `crawl_all.py` | Orchestrates all source crawlers | Workflow 1 |
| `dedup.py` | Cross-source dedup & merge engine | Workflow 1 |
| `discover_websites.py` | Find provider websites (Serper/DDG/guess) | Workflow 2 |
| `lookup_abn.py` | ABN validation via ABR API (free) | Workflow 2 |
| `enrich_websites.py` | Crawl provider sites, find pricing pages | Workflow 3 |
| `compute_tiers.py` | Compute listing_tier from data quality | After enrichment |
| `config.example.json` | API key template | — |
## API Keys Required
| Service | Key | Cost | Register |
|---------|-----|------|----------|
| Serper.dev | `serper_api_key` | 2,500 free, then $1/1K | https://serper.dev |
| ABR (ABN Lookup) | `abr_guid` | Free | https://abr.business.gov.au/Tools/WebServices |
| Anthropic (Haiku) | `anthropic_api_key` | ~$2/full run | https://console.anthropic.com |
## Quick Start
```bash
# 1. Configure API keys
cp config.example.json config.json
# Edit config.json with your keys
# 2. Reset database
cd ../database
sqlite3 providers.db < schema_sqlite.sql
# 3. Run full discovery pipeline
cd ../crawlers
python3 crawl_all.py # Step 1: Discover from registries
python3 dedup.py # Deduplicate across sources
python3 lookup_abn.py # Step 2a: Get ABNs (free)
python3 discover_websites.py # Step 2b: Find websites
python3 enrich_websites.py # Step 3: Crawl for pricing
python3 compute_tiers.py # Assign listing tiers
# Test mode (limited records)
python3 crawl_all.py --test
python3 discover_websites.py --limit=10 --state=VIC
python3 enrich_websites.py --limit=5
```

164
crawlers/base.py Normal file
View File

@@ -0,0 +1,164 @@
"""Base crawler module with shared utilities."""
import gzip
import io
import json
import time
import sqlite3
import urllib.request
import urllib.parse
import urllib.error
from datetime import datetime, timezone
from pathlib import Path
DB_PATH = Path(__file__).parent.parent / "database" / "providers.db"
CRAWL_DELAY = 1.0 # seconds between requests (courtesy)
USER_AGENT = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
)
def fetch_url(url: str, method: str = "GET", data: dict | None = None,
headers: dict | None = None, timeout: int = 30) -> str:
"""Fetch a URL and return the response body as text."""
hdrs = {"User-Agent": USER_AGENT}
if headers:
hdrs.update(headers)
body = None
if data and method == "POST":
body = urllib.parse.urlencode(data, doseq=True).encode("utf-8")
hdrs.setdefault("Content-Type", "application/x-www-form-urlencoded")
elif data and method == "GET":
url = url + "?" + urllib.parse.urlencode(data, doseq=True)
req = urllib.request.Request(url, data=body, headers=hdrs, method=method)
with urllib.request.urlopen(req, timeout=timeout) as resp:
raw = resp.read()
# Handle gzip-compressed responses
if resp.headers.get("Content-Encoding") == "gzip" or raw[:2] == b"\x1f\x8b":
raw = gzip.decompress(raw)
charset = resp.headers.get_content_charset() or "utf-8"
return raw.decode(charset)
def fetch_json(url: str, method: str = "GET", data: dict | None = None,
headers: dict | None = None) -> dict:
"""Fetch a URL and parse the response as JSON."""
text = fetch_url(url, method=method, data=data, headers=headers)
return json.loads(text)
def get_db() -> sqlite3.Connection:
"""Get a connection to the SQLite database."""
conn = sqlite3.connect(str(DB_PATH))
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
conn.row_factory = sqlite3.Row
return conn
def start_crawl_log(db: sqlite3.Connection, source_name: str) -> int:
"""Create a source_log entry and return its ID."""
cur = db.execute(
"INSERT INTO source_log (source_name) VALUES (?)",
(source_name,)
)
db.commit()
return cur.lastrowid
def finish_crawl_log(db: sqlite3.Connection, log_id: int,
found: int, new: int, updated: int, skipped: int,
status: str = "completed", error: str | None = None):
"""Update a source_log entry with results."""
db.execute(
"""UPDATE source_log
SET run_finished_at = datetime('now'),
records_found = ?, records_new = ?,
records_updated = ?, records_skipped = ?,
status = ?, error_message = ?
WHERE id = ?""",
(found, new, updated, skipped, status, error, log_id)
)
db.commit()
def store_source_record(db: sqlite3.Connection, source_name: str,
source_id: str, source_url: str | None,
raw_data: dict, log_id: int) -> int | None:
"""Store a raw source record. Returns the row ID, or None if duplicate."""
try:
cur = db.execute(
"""INSERT INTO source_record
(source_name, source_id, source_url, raw_data, log_id)
VALUES (?, ?, ?, ?, ?)""",
(source_name, source_id, source_url, json.dumps(raw_data), log_id)
)
db.commit()
return cur.lastrowid
except sqlite3.IntegrityError:
# Duplicate source_name + source_id — already have this record
return None
def normalize_phone(phone: str | None) -> str | None:
"""Basic phone normalization."""
if not phone:
return None
# Remove common noise
phone = phone.strip().replace("\xa0", " ")
# If multiple numbers, take the first
for sep in [";", "/", "|", ","]:
if sep in phone:
phone = phone.split(sep)[0].strip()
return phone or None
def normalize_state(state: str | None) -> str | None:
"""Normalize Australian state names to abbreviations."""
if not state:
return None
state = state.strip().upper()
mapping = {
"NEW SOUTH WALES": "NSW",
"VICTORIA": "VIC",
"QUEENSLAND": "QLD",
"SOUTH AUSTRALIA": "SA",
"WESTERN AUSTRALIA": "WA",
"TASMANIA": "TAS",
"NORTHERN TERRITORY": "NT",
"AUSTRALIAN CAPITAL TERRITORY": "ACT",
"AUSTRALIA CAPITAL TERRITORY": "ACT",
}
result = mapping.get(state, state)
# Only return valid Australian states
valid = {"NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"}
return result if result in valid else None
def generate_slug(name: str) -> str:
"""Generate a URL-safe slug from a business name."""
import re
slug = name.lower().strip()
slug = re.sub(r"[''`]", "", slug) # remove apostrophes
slug = re.sub(r"[^a-z0-9]+", "-", slug) # non-alphanum -> hyphen
slug = slug.strip("-")
return slug
def to_intermediate(source: str, source_id: str, source_url: str | None,
business: dict, locations: list[dict],
packages: list[dict] | None = None) -> dict:
"""Build the normalized intermediate format record."""
return {
"source": source,
"sourceId": source_id,
"sourceUrl": source_url,
"scrapedAt": datetime.now(timezone.utc).isoformat(),
"business": business,
"locations": locations,
"packages": packages or [],
}

102
crawlers/compute_tiers.py Normal file
View File

@@ -0,0 +1,102 @@
"""Compute listing_tier for all providers based on their data quality.
Tier logic:
verified — brand.verified = true (signed up to platform)
priced — has 2+ packages with at least one inclusion that has a price > 0
estimated — has at least one package with a total price > 0
listed — everything else (contact info only)
Run this after enrichment to update tiers across the board.
"""
from base import get_db
def compute_tier(db, brand_id: int, verified: bool) -> str:
"""Compute the listing tier for a single brand."""
if verified:
return "verified"
# Check packages
packages = db.execute(
"SELECT id, title, funeral_type FROM package WHERE brand_id = ?",
(brand_id,)
).fetchall()
if not packages:
return "listed"
# Count packages that have a meaningful total price
# A package's price = sum of non-optional, non-complimentary inclusions
packages_with_price = 0
packages_with_itemized = 0
for pkg in packages:
inclusions = db.execute(
"""SELECT price, optional, complimentary
FROM package_inclusion
WHERE package_id = ?""",
(pkg["id"],)
).fetchall()
if inclusions:
# Has itemized inclusions with prices
priced_inclusions = [
i for i in inclusions
if i["price"] and float(i["price"]) > 0
]
if len(priced_inclusions) >= 2:
packages_with_itemized += 1
packages_with_price += 1
elif len(priced_inclusions) >= 1:
packages_with_price += 1
else:
# Package exists but no inclusions — check if we stored a total
# price in the package description or via source data
# For now, a package with a funeral_type means we at least know
# what kind of service it is, even without breakdown
packages_with_price += 1
# Tier 2 (priced): 2+ packages with itemized breakdowns
if packages_with_itemized >= 2:
return "priced"
# Tier 3 (estimated): at least one package with some price
if packages_with_price >= 1:
return "estimated"
return "listed"
def run():
"""Recompute listing_tier for all brands."""
db = get_db()
brands = db.execute(
"SELECT id, verified FROM funeral_brand"
).fetchall()
counts = {"verified": 0, "priced": 0, "estimated": 0, "listed": 0}
for brand in brands:
tier = compute_tier(db, brand["id"], brand["verified"])
db.execute(
"UPDATE funeral_brand SET listing_tier = ? WHERE id = ?",
(tier, brand["id"])
)
counts[tier] += 1
db.commit()
print("Listing Tier Distribution:")
print(f" verified: {counts['verified']:>6d} (signed-up partners)")
print(f" priced: {counts['priced']:>6d} (full package breakdowns)")
print(f" estimated: {counts['estimated']:>6d} (some pricing info)")
print(f" listed: {counts['listed']:>6d} (contact info only)")
print(f" TOTAL: {sum(counts.values()):>6d}")
db.close()
if __name__ == "__main__":
run()

View File

@@ -0,0 +1,5 @@
{
"serper_api_key": null,
"abr_guid": null,
"anthropic_api_key": null
}

70
crawlers/crawl_all.py Normal file
View File

@@ -0,0 +1,70 @@
"""Run all source crawlers and then deduplicate into the provider database."""
import sys
import time
from pathlib import Path
from base import get_db
def run_all(gathered_here_limit: int | None = None):
"""Run all crawlers sequentially."""
print("=" * 60)
print("PROVIDER DISCOVERY PIPELINE")
print("=" * 60)
# Import crawlers
import crawl_nfda
import crawl_funerals_australia
import crawl_vic_register
import crawl_gathered_here
# Run in order: fast API sources first, then slower HTML scraping
print("\n--- 1/4: NFDA Directory ---")
crawl_nfda.run()
print("\n--- 2/4: Funerals Australia ---")
crawl_funerals_australia.run()
print("\n--- 3/4: VIC Consumer Affairs Register ---")
crawl_vic_register.run()
print("\n--- 4/4: Gathered Here ---")
crawl_gathered_here.run(limit=gathered_here_limit)
# Summary
db = get_db()
print("\n" + "=" * 60)
print("CRAWL SUMMARY")
print("=" * 60)
rows = db.execute(
"""SELECT source_name,
COUNT(*) as total,
SUM(CASE WHEN matched_brand_id IS NOT NULL THEN 1 ELSE 0 END) as matched
FROM source_record
GROUP BY source_name"""
).fetchall()
for row in rows:
print(f" {row['source_name']:25s} {row['total']:5d} records "
f"({row['matched']} matched)")
total = db.execute("SELECT COUNT(*) as n FROM source_record").fetchone()["n"]
print(f" {'TOTAL':25s} {total:5d} records")
db.close()
if __name__ == "__main__":
limit = None
if "--test" in sys.argv:
limit = 10
print("TEST MODE: Gathered Here limited to 10 profiles")
elif len(sys.argv) > 1:
try:
limit = int(sys.argv[1])
except ValueError:
pass
run_all(gathered_here_limit=limit)

View File

@@ -0,0 +1,179 @@
"""Crawler for the Funerals Australia (formerly AFDA) member directory.
Source: https://funeralsaustralia.org.au/find-a-member/
Method: WordPress AJAX API (POST with get_clients_list action)
Fields: name, address (structured), phone, email, website, lat/lng, displayImage
"""
import time
import json
from pathlib import Path
from base import (
fetch_url, get_db, start_crawl_log, finish_crawl_log,
store_source_record, normalize_phone, normalize_state,
generate_slug, to_intermediate, CRAWL_DELAY,
)
SOURCE_NAME = "funerals_australia"
API_URL = "https://funeralsaustralia.org.au/wp-admin/admin-ajax.php"
PAGE_SIZE = 200 # API supports up to 200 per page
def fetch_page(offset: int = 0) -> dict:
"""Fetch a page of all members from the Funerals Australia API.
The API returns all members when no postcode/suburb filter is given,
which is more reliable than geo-filtered searches.
"""
form_data = {
"action": "get_clients_list",
"params[size]": str(PAGE_SIZE),
"params[from]": str(offset),
"params[forceResults]": "true",
"params[paginated]": "true",
}
text = fetch_url(API_URL, method="POST", data=form_data,
headers={"X-Requested-With": "XMLHttpRequest"})
return json.loads(text)
def fetch_all_members() -> list[dict]:
"""Fetch all members via pagination."""
all_results = []
offset = 0
while True:
data = fetch_page(offset)
results = data.get("results", [])
total = data.get("total", 0)
if not results:
break
all_results.extend(results)
print(f" Fetched {len(all_results)}/{total}...")
offset += PAGE_SIZE
if offset >= total:
break
time.sleep(CRAWL_DELAY)
return all_results
def parse_address(record: dict) -> dict:
"""Extract structured address from a Funerals Australia record."""
addr_list = record.get("address", [])
if addr_list and isinstance(addr_list, list) and len(addr_list) > 0:
addr = addr_list[0]
return {
"line1": addr.get("line1", "").strip(),
"city": addr.get("city", "").strip(),
"state": normalize_state(addr.get("state")),
"postcode": addr.get("postcode", "").strip(),
}
return {"line1": "", "city": "", "state": None, "postcode": ""}
def to_normalized(record: dict) -> dict:
"""Convert a Funerals Australia record to intermediate format."""
addr = parse_address(record)
city = addr["city"]
if city and city == city.upper():
city = city.title()
lat_val = record.get("latitude")
lng_val = record.get("longitude")
try:
lat_val = float(lat_val) if lat_val else None
lng_val = float(lng_val) if lng_val else None
except (ValueError, TypeError):
lat_val = lng_val = None
website = record.get("website", "").strip() or None
if website and not website.startswith("http"):
website = "https://" + website
business = {
"name": record.get("name", "").strip(),
"abn": None,
"phone": normalize_phone(record.get("phone")),
"email": record.get("email", "").strip() or None,
"website": website,
"description": None,
}
locations = [{
"address": addr["line1"],
"suburb": city,
"state": addr["state"],
"postcode": addr["postcode"],
"lat": lat_val,
"lng": lng_val,
"phone": normalize_phone(record.get("phone")),
}]
source_id = record.get("id", "")
return to_intermediate(
source=SOURCE_NAME,
source_id=source_id,
source_url="https://funeralsaustralia.org.au/find-a-member/",
business=business,
locations=locations,
)
def run():
"""Run the full Funerals Australia crawl."""
db = get_db()
log_id = start_crawl_log(db, SOURCE_NAME)
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
all_records = []
found = 0
new = 0
skipped = 0
try:
print(" Fetching all members (paginated)...")
all_records = fetch_all_members()
found = len(all_records)
print(f" Total members fetched: {found}")
# Store records
for record in all_records:
source_id = record.get("id", "")
row_id = store_source_record(
db, SOURCE_NAME, source_id,
"https://funeralsaustralia.org.au/find-a-member/",
record, log_id
)
if row_id:
normalized = to_normalized(record)
db.execute(
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
(json.dumps(normalized), row_id)
)
new += 1
else:
skipped += 1
db.commit()
finish_crawl_log(db, log_id, found, new, 0, skipped)
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
except Exception as e:
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
raise
finally:
db.close()
return all_records
if __name__ == "__main__":
run()

View File

@@ -0,0 +1,362 @@
"""Crawler for Gathered Here funeral director directory.
Source: https://www.gatheredhere.com.au
Method: XML sitemap → fetch individual profile pages → parse HTML
Fields: name, address, coords, phone, email, website, description, pricing, reviews
"""
import re
import time
import json
import xml.etree.ElementTree as ET
from html.parser import HTMLParser
from pathlib import Path
from base import (
fetch_url, get_db, start_crawl_log, finish_crawl_log,
store_source_record, normalize_phone, normalize_state,
generate_slug, to_intermediate, CRAWL_DELAY,
)
SOURCE_NAME = "gathered_here"
SITEMAP_URL = "https://www.gatheredhere.com.au/sitemap/sitemap-funerals-listings-0.xml"
BASE_URL = "https://www.gatheredhere.com.au"
def fetch_all_listing_urls() -> list[str]:
"""Fetch and parse the sitemap to get all funeral director profile URLs."""
xml_text = fetch_url(SITEMAP_URL)
root = ET.fromstring(xml_text)
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = []
for url_elem in root.findall("sm:url", ns):
loc = url_elem.find("sm:loc", ns)
if loc is not None and loc.text:
url = loc.text.strip()
# Only include individual profile pages (singular /funeral-director/)
if "/funeral-director/" in url and "/funeral-directors/" not in url:
urls.append(url)
return urls
def extract_next_data(html_text: str) -> dict | None:
"""Extract __NEXT_DATA__ JSON from a Next.js page."""
pattern = r'<script\s+id="__NEXT_DATA__"\s+type="application/json">(.*?)</script>'
match = re.search(pattern, html_text, re.DOTALL)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return None
return None
def extract_from_next_data(next_data: dict) -> dict | None:
"""Extract listing data from __NEXT_DATA__ props."""
try:
props = next_data.get("props", {}).get("pageProps", {})
# Structure: singleListing.listing contains the actual data
single = props.get("singleListing", {})
if single:
listing = single.get("listing")
if listing and isinstance(listing, dict):
return listing
# Fallback paths
listing = props.get("listing") or props.get("post") or props.get("data")
return listing
except (KeyError, TypeError):
return None
def extract_from_html(html_text: str, url: str) -> dict:
"""Extract listing data from page HTML using regex patterns as fallback."""
data = {"url": url}
# Title
title_match = re.search(r'<h1[^>]*>(.*?)</h1>', html_text, re.DOTALL)
if title_match:
data["title"] = re.sub(r'<[^>]+>', '', title_match.group(1)).strip()
# Phone
phone_match = re.search(r'href="tel:([^"]+)"', html_text)
if phone_match:
data["phone"] = phone_match.group(1).strip()
# Email
email_match = re.search(r'href="mailto:([^"]+)"', html_text)
if email_match:
data["email"] = email_match.group(1).strip()
# Website
website_match = re.search(
r'<a[^>]*class="[^"]*website[^"]*"[^>]*href="([^"]+)"', html_text
)
if website_match:
data["website"] = website_match.group(1).strip()
# Address from structured data
addr_match = re.search(
r'"streetAddress"\s*:\s*"([^"]*)"', html_text
)
if addr_match:
data["address"] = addr_match.group(1)
locality_match = re.search(r'"addressLocality"\s*:\s*"([^"]*)"', html_text)
if locality_match:
data["suburb"] = locality_match.group(1)
region_match = re.search(r'"addressRegion"\s*:\s*"([^"]*)"', html_text)
if region_match:
data["state"] = region_match.group(1)
postcode_match = re.search(r'"postalCode"\s*:\s*"([^"]*)"', html_text)
if postcode_match:
data["postcode"] = postcode_match.group(1)
# Coordinates
lat_match = re.search(r'"latitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
lng_match = re.search(r'"longitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
if lat_match:
data["lat"] = float(lat_match.group(1))
if lng_match:
data["lng"] = float(lng_match.group(1))
return data
def extract_pricing(listing_data: dict) -> dict:
"""Extract pricing from listing meta fields."""
meta = listing_data.get("meta", {})
if not meta:
return {}
pricing = {}
price_fields = {
# With viewing prices
"cremation_no_service_viewY": "cremation_no_service_with_viewing",
"cremation_single_viewY": "cremation_single_service_with_viewing",
"cremation_dual_viewY": "cremation_dual_service_with_viewing",
"cremation_graveside_viewY": "cremation_graveside_with_viewing",
"burial_single_viewY": "burial_single_service_with_viewing",
"burial_dual_viewY": "burial_dual_service_with_viewing",
"burial_graveside_viewY": "burial_graveside_with_viewing",
"burial_no_service_viewY": "burial_no_service_with_viewing",
# Without viewing prices
"cremation_no_service_viewN": "cremation_no_service",
"cremation_single_viewN": "cremation_single_service",
"cremation_dual_viewN": "cremation_dual_service",
"cremation_graveside_viewN": "cremation_graveside",
"burial_single_viewN": "burial_single_service",
"burial_dual_viewN": "burial_dual_service",
"burial_graveside_viewN": "burial_graveside",
"burial_no_service_viewN": "burial_no_service",
}
for meta_key, label in price_fields.items():
val = meta.get(meta_key, "")
if val:
# Parse price string like "$2,299" to float
cleaned = re.sub(r'[^\d.]', '', str(val))
if cleaned:
try:
pricing[label] = float(cleaned)
except ValueError:
pass
return pricing
def pricing_to_packages(pricing: dict) -> list[dict]:
"""Convert flat pricing dict to package format."""
packages = []
# Map pricing keys to funeral types
type_mappings = [
("cremation_no_service", "Cremation Only"),
("cremation_single_service", "Service & Cremation"),
("cremation_single_service_with_viewing", "Service & Cremation"),
("burial_single_service", "Service & Burial"),
("burial_graveside", "Graveside Burial"),
]
for price_key, funeral_type in type_mappings:
if price_key in pricing:
name = price_key.replace("_", " ").title()
packages.append({
"name": name,
"funeralType": funeral_type,
"price": pricing[price_key],
"inclusions": [], # Not available from Gathered Here listing pages
})
return packages
def to_normalized(listing_data: dict, url: str) -> dict:
"""Convert Gathered Here listing data to intermediate format."""
meta = listing_data.get("meta", {}) if isinstance(listing_data.get("meta"), dict) else {}
name = listing_data.get("title", listing_data.get("name", "")).strip()
slug = listing_data.get("slug", "")
# Extract location
suburb = meta.get("geolocation_city", "")
state = normalize_state(meta.get("geolocation_state_short", ""))
postcode = meta.get("geolocation_postcode", "")
lat = meta.get("geolocation_lat")
lng = meta.get("geolocation_long")
try:
lat = float(lat) if lat else None
lng = float(lng) if lng else None
except (ValueError, TypeError):
lat = lng = None
email = meta.get("email", "") or meta.get("_application", "")
phone = meta.get("phone", "") or listing_data.get("phone", "")
# Try to get description from content or excerpt
description = listing_data.get("excerpt", listing_data.get("content", ""))
if description:
description = re.sub(r'<[^>]+>', '', description).strip()
if len(description) > 500:
description = description[:497] + "..."
# Website
website = listing_data.get("website") or meta.get("website") or None
# Pricing
pricing = extract_pricing(listing_data)
packages = pricing_to_packages(pricing)
business = {
"name": name,
"abn": None,
"phone": normalize_phone(phone),
"email": email.strip() or None,
"website": website,
"description": description or None,
}
locations = [{
"address": meta.get("geolocation_formatted_address", ""),
"suburb": suburb,
"state": state,
"postcode": postcode,
"lat": lat,
"lng": lng,
"phone": normalize_phone(phone),
}]
source_id = slug or generate_slug(name)
return to_intermediate(
source=SOURCE_NAME,
source_id=source_id,
source_url=url,
business=business,
locations=locations,
packages=packages,
)
def crawl_profile(url: str) -> dict | None:
"""Crawl a single Gathered Here profile page."""
try:
html_text = fetch_url(url)
except Exception as e:
print(f" Error fetching {url}: {e}")
return None
# Try __NEXT_DATA__ first (structured)
next_data = extract_next_data(html_text)
if next_data:
listing = extract_from_next_data(next_data)
if listing:
listing["_source"] = "next_data"
return listing
# Fallback to HTML parsing
data = extract_from_html(html_text, url)
data["_source"] = "html_fallback"
return data
def run(limit: int | None = None):
"""Run the full Gathered Here crawl.
Args:
limit: If set, only crawl this many profiles (for testing).
"""
db = get_db()
log_id = start_crawl_log(db, SOURCE_NAME)
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
found = 0
new = 0
skipped = 0
errors = 0
try:
# Step 1: Get all profile URLs from sitemap
print(" Fetching sitemap...", end=" ", flush=True)
urls = fetch_all_listing_urls()
print(f"{len(urls)} profile URLs found")
if limit:
urls = urls[:limit]
print(f" (limited to {limit} for testing)")
# Step 2: Crawl each profile
for i, url in enumerate(urls):
slug = url.rstrip("/").split("/")[-1]
if (i + 1) % 50 == 0 or i == 0:
print(f" Crawling {i+1}/{len(urls)}: {slug}")
listing_data = crawl_profile(url)
found += 1
if not listing_data:
errors += 1
continue
source_id = slug
row_id = store_source_record(
db, SOURCE_NAME, source_id, url, listing_data, log_id
)
if row_id:
normalized = to_normalized(listing_data, url)
db.execute(
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
(json.dumps(normalized), row_id)
)
new += 1
else:
skipped += 1
if (i + 1) % 10 == 0:
db.commit() # periodic commit
time.sleep(CRAWL_DELAY)
db.commit()
finish_crawl_log(db, log_id, found, new, 0, skipped)
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, "
f"{skipped} skipped, {errors} errors")
except Exception as e:
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
raise
finally:
db.close()
if __name__ == "__main__":
import sys
limit = int(sys.argv[1]) if len(sys.argv) > 1 else None
run(limit=limit)

163
crawlers/crawl_nfda.py Normal file
View File

@@ -0,0 +1,163 @@
"""Crawler for the NFDA (National Funeral Directors Association) directory.
Source: https://nfda.com.au/find-your-local-nfda-member/
Method: WPSL JSON API (GET requests with lat/lng search)
Fields: name, address, city, state, postcode, lat/lng, phone, email
"""
import time
import json
from pathlib import Path
from base import (
fetch_json, get_db, start_crawl_log, finish_crawl_log,
store_source_record, normalize_phone, normalize_state,
generate_slug, to_intermediate, CRAWL_DELAY,
)
SOURCE_NAME = "nfda"
API_URL = "https://nfda.com.au/wp-admin/admin-ajax.php"
# Search centroids covering Australia with large radius
SEARCH_POINTS = [
{"name": "Sydney", "lat": -33.87, "lng": 151.21},
{"name": "Melbourne", "lat": -37.81, "lng": 144.96},
{"name": "Brisbane", "lat": -27.47, "lng": 153.03},
{"name": "Perth", "lat": -31.95, "lng": 115.86},
{"name": "Adelaide", "lat": -34.93, "lng": 138.60},
{"name": "Hobart", "lat": -42.88, "lng": 147.33},
{"name": "Darwin", "lat": -12.46, "lng": 130.85},
{"name": "Townsville", "lat": -19.26, "lng": 146.82},
{"name": "Central NSW", "lat": -30.0, "lng": 150.0},
{"name": "Inland QLD", "lat": -23.0, "lng": 145.0},
]
def fetch_members(lat: float, lng: float, max_results: int = 50,
radius: int = 5000) -> list[dict]:
"""Fetch NFDA members near a given lat/lng."""
params = {
"action": "store_search",
"lat": str(lat),
"lng": str(lng),
"max_results": str(max_results),
"search_radius": str(radius),
"autoload": "1",
}
data = fetch_json(API_URL, method="GET", data=params)
if isinstance(data, list):
return data
return []
def to_normalized(record: dict) -> dict:
"""Convert an NFDA record to intermediate format."""
state = normalize_state(record.get("state", ""))
business = {
"name": record.get("store", "").strip(),
"abn": None,
"phone": normalize_phone(record.get("phone")),
"email": record.get("email", "").strip() or None,
"website": record.get("url", "").strip() or None,
"description": None,
}
lat_val = record.get("lat")
lng_val = record.get("lng")
try:
lat_val = float(lat_val) if lat_val else None
lng_val = float(lng_val) if lng_val else None
except (ValueError, TypeError):
lat_val = lng_val = None
city = record.get("city", "").strip()
# Normalize city casing (some are ALL CAPS)
if city and city == city.upper():
city = city.title()
locations = [{
"address": record.get("address", "").strip(),
"suburb": city,
"state": state,
"postcode": record.get("zip", "").strip(),
"lat": lat_val,
"lng": lng_val,
"phone": normalize_phone(record.get("phone")),
}]
source_id = str(record.get("id", ""))
return to_intermediate(
source=SOURCE_NAME,
source_id=source_id,
source_url="https://nfda.com.au/find-your-local-nfda-member/",
business=business,
locations=locations,
)
def run():
"""Run the full NFDA crawl."""
db = get_db()
log_id = start_crawl_log(db, SOURCE_NAME)
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
seen_ids = set()
all_records = []
found = 0
new = 0
skipped = 0
try:
for point in SEARCH_POINTS:
print(f" Searching near {point['name']}...", end=" ", flush=True)
members = fetch_members(point["lat"], point["lng"])
new_count = 0
for member in members:
member_id = str(member.get("id", ""))
if member_id in seen_ids:
continue
seen_ids.add(member_id)
all_records.append(member)
new_count += 1
print(f"{len(members)} results, {new_count} new unique")
found += len(members)
time.sleep(CRAWL_DELAY)
print(f" Total unique members: {len(all_records)}")
# Store records
for record in all_records:
source_id = str(record.get("id", ""))
row_id = store_source_record(
db, SOURCE_NAME, source_id,
"https://nfda.com.au/find-your-local-nfda-member/",
record, log_id
)
if row_id:
normalized = to_normalized(record)
db.execute(
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
(json.dumps(normalized), row_id)
)
new += 1
else:
skipped += 1
db.commit()
finish_crawl_log(db, log_id, found, new, 0, skipped)
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
except Exception as e:
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
raise
finally:
db.close()
return all_records
if __name__ == "__main__":
run()

View File

@@ -0,0 +1,220 @@
"""Crawler for the VIC Consumer Affairs Public Register of Funeral Providers.
Source: https://registers.consumer.vic.gov.au/fpsearch
Method: HTTP GET per letter A-Z, parse HTML tables
Fields: name, place of business, postcode, postal address, phone
"""
import re
import time
import json
import html.parser
from pathlib import Path
from base import (
fetch_url, get_db, start_crawl_log, finish_crawl_log,
store_source_record, normalize_phone, generate_slug,
to_intermediate, CRAWL_DELAY,
)
SOURCE_NAME = "vic_register"
BASE_URL = "https://registers.consumer.vic.gov.au/FpSearch/PerformSearch"
class VICTableParser(html.parser.HTMLParser):
"""Parse the VIC register HTML table into records."""
def __init__(self):
super().__init__()
self.records = []
self._in_table = False
self._in_tbody = False
self._in_row = False
self._in_cell = False
self._current_row = []
self._current_cell = ""
def handle_starttag(self, tag, attrs):
if tag == "table":
self._in_table = True
elif tag == "tbody" and self._in_table:
self._in_tbody = True
elif tag == "tr" and self._in_tbody:
self._in_row = True
self._current_row = []
elif tag == "td" and self._in_row:
self._in_cell = True
self._current_cell = ""
def handle_endtag(self, tag):
if tag == "td" and self._in_cell:
self._in_cell = False
self._current_row.append(self._current_cell.strip())
elif tag == "tr" and self._in_row:
self._in_row = False
if len(self._current_row) >= 4:
self.records.append(self._current_row)
elif tag == "tbody":
self._in_tbody = False
elif tag == "table":
self._in_table = False
def handle_data(self, data):
if self._in_cell:
self._current_cell += data
def parse_address(place_of_business: str) -> dict:
"""Parse a VIC register address into components."""
parts = place_of_business.strip()
# Try to extract postcode from the end
postcode_match = re.search(r'\b(\d{4})\s*$', parts)
postcode = postcode_match.group(1) if postcode_match else None
# Try to extract suburb (usually the last word(s) before postcode)
suburb = None
if postcode:
before_postcode = parts[:postcode_match.start()].strip().rstrip(",").strip()
# Last segment after comma is usually suburb
if "," in before_postcode:
suburb = before_postcode.split(",")[-1].strip()
else:
# Take last 1-2 words as suburb
words = before_postcode.split()
if len(words) >= 2:
suburb = " ".join(words[-2:]) if words[-1][0].isupper() else words[-1]
return {
"address": parts,
"suburb": suburb,
"state": "VIC",
"postcode": postcode,
}
def crawl_letter(letter: str) -> list[dict]:
"""Crawl all records for a single letter."""
url = f"{BASE_URL}?Letter={letter}"
html_text = fetch_url(url)
parser = VICTableParser()
parser.feed(html_text)
records = []
for row in parser.records:
# Columns: Name, Place of Business, Postcode, Postal Address, Phone
name = row[0] if len(row) > 0 else ""
place = row[1] if len(row) > 1 else ""
postcode = row[2] if len(row) > 2 else ""
postal = row[3] if len(row) > 3 else ""
phone = row[4] if len(row) > 4 else ""
if not name:
continue
records.append({
"name": name.strip(),
"place_of_business": place.strip(),
"postcode": postcode.strip(),
"postal_address": postal.strip(),
"phone": phone.strip(),
})
return records
def make_source_id(record: dict) -> str:
"""Create a stable source ID from name + address."""
name = record["name"].lower().strip()
addr = record["place_of_business"].lower().strip()
return f"{generate_slug(name)}_{record['postcode']}"
def to_normalized(record: dict) -> dict:
"""Convert a VIC register record to intermediate format."""
addr = parse_address(record["place_of_business"])
business = {
"name": record["name"],
"abn": None,
"phone": normalize_phone(record["phone"]),
"email": None,
"website": None,
"description": None,
}
locations = [{
"address": record["place_of_business"],
"suburb": addr["suburb"],
"state": "VIC",
"postcode": record["postcode"] or addr["postcode"],
"lat": None,
"lng": None,
"phone": normalize_phone(record["phone"]),
}]
source_id = make_source_id(record)
return to_intermediate(
source=SOURCE_NAME,
source_id=source_id,
source_url=f"{BASE_URL}?Letter={record['name'][0].upper()}",
business=business,
locations=locations,
)
def run():
"""Run the full VIC register crawl."""
db = get_db()
log_id = start_crawl_log(db, SOURCE_NAME)
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
all_records = []
found = 0
new = 0
skipped = 0
try:
for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
print(f" Crawling letter {letter}...", end=" ", flush=True)
records = crawl_letter(letter)
print(f"{len(records)} records")
all_records.extend(records)
found += len(records)
if letter != "Z":
time.sleep(CRAWL_DELAY)
# Store and normalize
for record in all_records:
source_id = make_source_id(record)
row_id = store_source_record(
db, SOURCE_NAME, source_id,
f"{BASE_URL}?Letter={record['name'][0].upper()}",
record, log_id
)
if row_id:
normalized = to_normalized(record)
db.execute(
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
(json.dumps(normalized), row_id)
)
new += 1
else:
skipped += 1
db.commit()
finish_crawl_log(db, log_id, found, new, 0, skipped)
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
except Exception as e:
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
raise
finally:
db.close()
return all_records
if __name__ == "__main__":
run()

425
crawlers/dedup.py Normal file
View File

@@ -0,0 +1,425 @@
"""Deduplication and merge engine.
Processes source_records → funeral_brand + location + package entries.
Handles cross-source matching and field-level merging.
Matching hierarchy (strongest to weakest):
1. source_key match — same record from same source (skip/update)
2. ABN match — same business entity
3. Name + Postcode exact match — likely same business
4. Fuzzy name match (>85%) + same state — probable match, flag for review
Merge priority (higher = preferred):
vic_register > funerals_australia > nfda > gathered_here
Never overwrite verified provider data.
"""
import json
import re
import sqlite3
from difflib import SequenceMatcher
from base import get_db, generate_slug, normalize_state
# Source priority for merge conflicts (higher number = more authoritative)
SOURCE_PRIORITY = {
"vic_register": 40,
"funerals_australia": 30,
"nfda": 20,
"gathered_here": 10,
}
def normalize_name(name: str) -> str:
"""Normalize a business name for comparison."""
name = name.strip().upper()
# Remove common suffixes
for suffix in [" PTY LTD", " PTY. LTD.", " P/L", " LIMITED",
" PROPRIETARY LIMITED", " INC", " LLC",
" FUNERAL DIRECTORS", " FUNERAL SERVICES",
" FUNERALS", " FUNERAL HOME"]:
name = name.removesuffix(suffix)
# Remove punctuation
name = re.sub(r"[''`\".,&()-]", " ", name)
name = re.sub(r"\s+", " ", name).strip()
return name
def fuzzy_match(name1: str, name2: str) -> float:
"""Return similarity ratio between two names (0.0 to 1.0)."""
n1 = normalize_name(name1)
n2 = normalize_name(name2)
return SequenceMatcher(None, n1, n2).ratio()
def find_existing_brand(db: sqlite3.Connection, record: dict) -> tuple[int | None, str]:
"""Find a matching funeral_brand for a source record.
Returns (brand_id, match_type) or (None, 'new').
"""
biz = record.get("business", {})
locs = record.get("locations", [])
name = biz.get("name", "")
abn = biz.get("abn")
source = record.get("source", "")
source_id = record.get("sourceId", "")
source_key = f"{source}:{source_id}"
postcode = None
state = None
if locs:
postcode = locs[0].get("postcode")
state = locs[0].get("state")
# 1. Source key match (exact same record from same source)
row = db.execute(
"SELECT id FROM funeral_brand WHERE source_key = ?",
(source_key,)
).fetchone()
if row:
return row["id"], "source_key"
# 2. ABN match
if abn:
row = db.execute(
"SELECT id FROM funeral_brand WHERE abn = ?",
(abn,)
).fetchone()
if row:
return row["id"], "abn"
# 3. Exact name + postcode match
if name and postcode:
norm = normalize_name(name)
# Check all brands — need fuzzy on name
rows = db.execute(
"SELECT id, title FROM funeral_brand WHERE business_postcode = ?",
(postcode,)
).fetchall()
for row in rows:
if normalize_name(row["title"]) == norm:
return row["id"], "name_postcode"
# 4. Fuzzy name + same state
if name and state:
rows = db.execute(
"SELECT id, title FROM funeral_brand WHERE business_state = ?",
(state,)
).fetchall()
for row in rows:
score = fuzzy_match(name, row["title"])
if score >= 0.85:
return row["id"], "fuzzy"
return None, "new"
def merge_field(existing: str | None, new_val: str | None,
existing_priority: int, new_priority: int) -> str | None:
"""Merge a single field, preferring non-null and higher-priority."""
if not new_val:
return existing
if not existing:
return new_val
# Both have values — prefer higher priority source
if new_priority > existing_priority:
return new_val
return existing
def create_brand(db: sqlite3.Connection, record: dict) -> int:
"""Create a new funeral_brand from a source record."""
biz = record.get("business", {})
locs = record.get("locations", [])
source = record.get("source", "")
source_id = record.get("sourceId", "")
source_key = f"{source}:{source_id}"
loc = locs[0] if locs else {}
slug = generate_slug(biz.get("name", "unknown"))
# Ensure unique slug
base_slug = slug
counter = 1
while True:
existing = db.execute(
"SELECT id FROM funeral_brand WHERE code = ?", (slug,)
).fetchone()
if not existing:
break
slug = f"{base_slug}-{counter}"
counter += 1
cur = db.execute(
"""INSERT INTO funeral_brand (
title, description, email, phone, website, abn, code,
hidden, verified, source_key, source_url, enrichment_status,
business_address, business_suburb, business_state, business_postcode
) VALUES (?, ?, ?, ?, ?, ?, ?, 1, 0, ?, ?, 'pending', ?, ?, ?, ?)""",
(
biz.get("name"),
biz.get("description"),
biz.get("email"),
biz.get("phone"),
biz.get("website"),
biz.get("abn"),
slug,
source_key,
record.get("sourceUrl"),
loc.get("address"),
loc.get("suburb"),
loc.get("state"),
loc.get("postcode"),
)
)
brand_id = cur.lastrowid
# Create locations
for loc_data in locs:
title_parts = [loc_data.get("suburb", ""), loc_data.get("state", "")]
loc_title = ", ".join(p for p in title_parts if p) or biz.get("name", "")
db.execute(
"""INSERT INTO location (
title, address, suburb, state, postcode, lat, lng, brand_id
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
(
loc_title,
loc_data.get("address"),
loc_data.get("suburb"),
loc_data.get("state"),
loc_data.get("postcode"),
loc_data.get("lat"),
loc_data.get("lng"),
brand_id,
)
)
# Create packages (from Gathered Here pricing)
packages = record.get("packages", [])
for pkg in packages:
if not pkg.get("price"):
continue
cur = db.execute(
"""INSERT INTO package (
title, funeral_type, brand_id, source_url, extraction_confidence
) VALUES (?, ?, ?, ?, ?)""",
(
pkg.get("name"),
pkg.get("funeralType"),
brand_id,
record.get("sourceUrl"),
0.8, # Gathered Here pricing is structured, fairly reliable
)
)
pkg_id = cur.lastrowid
# Create inclusions if available
for inc in pkg.get("inclusions", []):
db.execute(
"""INSERT INTO package_inclusion (
price, optional, complimentary, inclusion_type_title, package_id
) VALUES (?, ?, ?, ?, ?)""",
(
inc.get("price", 0),
1 if inc.get("optional") else 0,
1 if inc.get("complimentary") else 0,
inc.get("item", "Unknown"),
pkg_id,
)
)
return brand_id
def update_brand(db: sqlite3.Connection, brand_id: int,
record: dict, match_type: str) -> bool:
"""Merge new data into an existing brand. Returns True if updated."""
biz = record.get("business", {})
locs = record.get("locations", [])
source = record.get("source", "")
new_priority = SOURCE_PRIORITY.get(source, 0)
# Never overwrite verified providers
brand = db.execute(
"SELECT * FROM funeral_brand WHERE id = ?", (brand_id,)
).fetchone()
if brand["verified"]:
return False
# Determine existing source priority
existing_source = ""
if brand["source_key"]:
existing_source = brand["source_key"].split(":")[0]
existing_priority = SOURCE_PRIORITY.get(existing_source, 0)
# Field-level merge — only fill blanks or upgrade from higher priority
updates = {}
field_map = {
"description": biz.get("description"),
"email": biz.get("email"),
"phone": biz.get("phone"),
"website": biz.get("website"),
"abn": biz.get("abn"),
}
for field, new_val in field_map.items():
merged = merge_field(brand[field], new_val, existing_priority, new_priority)
if merged != brand[field]:
updates[field] = merged
# Update location data if we have coords and existing doesn't
if locs:
loc = locs[0]
existing_locs = db.execute(
"SELECT * FROM location WHERE brand_id = ?", (brand_id,)
).fetchall()
if not existing_locs and loc.get("suburb"):
title_parts = [loc.get("suburb", ""), loc.get("state", "")]
loc_title = ", ".join(p for p in title_parts if p)
db.execute(
"""INSERT INTO location (
title, address, suburb, state, postcode, lat, lng, brand_id
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
(
loc_title, loc.get("address"), loc.get("suburb"),
loc.get("state"), loc.get("postcode"),
loc.get("lat"), loc.get("lng"), brand_id,
)
)
elif existing_locs:
# Update first location with coords if missing
eloc = existing_locs[0]
if not eloc["lat"] and loc.get("lat"):
db.execute(
"UPDATE location SET lat = ?, lng = ? WHERE id = ?",
(loc.get("lat"), loc.get("lng"), eloc["id"])
)
# Add packages if we have them and brand doesn't yet
packages = record.get("packages", [])
if packages:
existing_pkgs = db.execute(
"SELECT COUNT(*) as n FROM package WHERE brand_id = ?", (brand_id,)
).fetchone()["n"]
if existing_pkgs == 0:
for pkg in packages:
if not pkg.get("price"):
continue
cur = db.execute(
"""INSERT INTO package (
title, funeral_type, brand_id, source_url
) VALUES (?, ?, ?, ?)""",
(pkg.get("name"), pkg.get("funeralType"),
brand_id, record.get("sourceUrl"))
)
if updates:
set_clause = ", ".join(f"{k} = ?" for k in updates)
values = list(updates.values()) + [brand_id]
db.execute(
f"UPDATE funeral_brand SET {set_clause}, updated_at = datetime('now') WHERE id = ?",
values
)
return True
return False
def process_all():
"""Process all source_records through deduplication and create brand entries.
Order matters: process higher-priority sources first so their data
forms the base record that lower-priority sources merge into.
"""
db = get_db()
# Process in priority order (highest first)
sources_ordered = sorted(SOURCE_PRIORITY.keys(),
key=lambda s: SOURCE_PRIORITY[s], reverse=True)
stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
print("=" * 60)
print("DEDUPLICATION ENGINE")
print("=" * 60)
for source in sources_ordered:
records = db.execute(
"""SELECT id, normalized_data FROM source_record
WHERE source_name = ? AND normalized_data IS NOT NULL""",
(source,)
).fetchall()
if not records:
continue
print(f"\n Processing {source}: {len(records)} records")
source_stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
for row in records:
record = json.loads(row["normalized_data"])
brand_id, match_type = find_existing_brand(db, record)
if match_type == "new":
brand_id = create_brand(db, record)
source_stats["new"] += 1
elif match_type == "source_key":
source_stats["skipped"] += 1
else:
# Matched to existing — merge
updated = update_brand(db, brand_id, record, match_type)
if updated:
source_stats["updated"] += 1
else:
source_stats["matched"] += 1
# Update source_record with match info
db.execute(
"""UPDATE source_record
SET matched_brand_id = ?, match_type = ?, processed_at = datetime('now')
WHERE id = ?""",
(brand_id, match_type, row["id"])
)
db.commit()
print(f" New: {source_stats['new']}, Updated: {source_stats['updated']}, "
f"Matched: {source_stats['matched']}, Skipped: {source_stats['skipped']}")
for k, v in source_stats.items():
stats[k] += v
# Final summary
total_brands = db.execute("SELECT COUNT(*) as n FROM funeral_brand").fetchone()["n"]
total_locations = db.execute("SELECT COUNT(*) as n FROM location").fetchone()["n"]
total_packages = db.execute("SELECT COUNT(*) as n FROM package").fetchone()["n"]
print(f"\n{'=' * 60}")
print(f"DEDUP RESULTS")
print(f"{'=' * 60}")
print(f" New brands created: {stats['new']}")
print(f" Existing updated: {stats['updated']}")
print(f" Matched (no change): {stats['matched']}")
print(f" Skipped (source_key): {stats['skipped']}")
print(f"\n Total brands in DB: {total_brands}")
print(f" Total locations in DB: {total_locations}")
print(f" Total packages in DB: {total_packages}")
# Show match type breakdown
print(f"\n Match type breakdown:")
rows = db.execute(
"""SELECT match_type, COUNT(*) as n
FROM source_record WHERE processed_at IS NOT NULL
GROUP BY match_type ORDER BY n DESC"""
).fetchall()
for row in rows:
print(f" {row['match_type']:15s} {row['n']:5d}")
db.close()
if __name__ == "__main__":
process_all()

View File

@@ -0,0 +1,320 @@
"""Website discovery module.
For each provider without a website URL, attempts to find their website
using multiple strategies (tried in order):
1. Serper.dev (2,500 free Google searches, no CC needed)
2. DuckDuckGo lite (free fallback, rate-limited)
3. URL pattern guessing (businessname.com.au)
Also validates discovered URLs to confirm they belong to the business.
Configuration:
Set SERPER_API_KEY env var or in config.json to enable Serper.dev.
Without it, falls back to DuckDuckGo.
"""
import json
import os
import re
import time
import urllib.parse
import urllib.request
import urllib.error
from pathlib import Path
from base import (
fetch_url, get_db, normalize_phone, CRAWL_DELAY,
)
# Load Serper API key from env or config
SERPER_API_KEY = os.environ.get("SERPER_API_KEY")
if not SERPER_API_KEY:
config_path = Path(__file__).parent / "config.json"
if config_path.exists():
with open(config_path) as f:
config = json.load(f)
SERPER_API_KEY = config.get("serper_api_key")
# Domains to skip when extracting search results
SKIP_DOMAINS = [
"yellowpages", "whitepages", "truelocal", "yelp", "cylex",
"australia247", "showmelocal", "hotfrog", "localsearch",
"facebook.com", "linkedin.com", "instagram.com", "twitter.com",
"gatheredhere", "ezifunerals", "funeralocity", "funeraldirectory",
"deathsandfunerals", "mytributes", "obits.com",
"duckduckgo.com", "google.com", "bing.com",
"nfda.com.au", "funeralsaustralia.org",
"wikipedia.org", "youtube.com",
]
def search_serper(query: str) -> list[str]:
"""Search via Serper.dev (Google results as JSON). 2,500 free queries."""
if not SERPER_API_KEY:
return []
url = "https://google.serper.dev/search"
data = json.dumps({"q": query, "gl": "au", "num": 10}).encode("utf-8")
req = urllib.request.Request(url, data=data, headers={
"X-API-KEY": SERPER_API_KEY,
"Content-Type": "application/json",
})
try:
with urllib.request.urlopen(req, timeout=15) as resp:
result = json.loads(resp.read().decode("utf-8"))
except Exception:
return []
results = []
for item in result.get("organic", []):
link = item.get("link", "")
if not link:
continue
if any(d in link.lower() for d in SKIP_DOMAINS):
continue
results.append(link)
return results
def search_ddg(query: str) -> list[str]:
"""Search DuckDuckGo lite and return result URLs (filtered)."""
encoded = urllib.parse.quote(query)
url = f"https://lite.duckduckgo.com/lite/?q={encoded}"
try:
html = fetch_url(url)
except Exception:
return []
# Extract redirect URLs from DDG lite format
raw_links = re.findall(
r'href="//duckduckgo\.com/l/\?uddg=([^&"]+)', html
)
results = []
for link in raw_links:
decoded = urllib.parse.unquote(link)
# Skip ads
if "ad_domain" in decoded or "ad_provider" in decoded:
continue
# Skip directory/aggregator sites
if any(d in decoded.lower() for d in SKIP_DOMAINS):
continue
results.append(decoded)
return results
def validate_url(url: str, business_name: str) -> dict:
"""Validate that a URL is a real website belonging to this business.
Returns: {valid: bool, confidence: str, reason: str}
"""
try:
html = fetch_url(url, timeout=15)
except urllib.error.HTTPError as e:
return {"valid": False, "confidence": "none", "reason": f"HTTP {e.code}"}
except Exception as e:
return {"valid": False, "confidence": "none", "reason": str(e)[:100]}
html_lower = html.lower()
# Check if it's a parked/for-sale domain
parked_signals = ["domain is for sale", "buy this domain",
"parked domain", "this domain", "godaddy",
"domain parking"]
if any(s in html_lower for s in parked_signals):
return {"valid": False, "confidence": "none", "reason": "parked domain"}
# Check if the page mentions the business name
name_parts = business_name.lower().split()
# Require at least 2 name parts to match (or all if name is 1-2 words)
min_matches = min(2, len(name_parts))
matches = sum(1 for part in name_parts
if len(part) > 2 and part in html_lower)
if matches >= min_matches:
return {"valid": True, "confidence": "confirmed", "reason": "name found in page"}
# Check title tag
title_match = re.search(r"<title[^>]*>(.*?)</title>", html, re.IGNORECASE | re.DOTALL)
if title_match:
title = title_match.group(1).lower()
if any(part in title for part in name_parts if len(part) > 2):
return {"valid": True, "confidence": "probable",
"reason": "partial name in title"}
# Check for funeral-related content (it's at least a funeral business)
funeral_signals = ["funeral", "cremation", "burial", "memorial",
"chapel", "obituar", "condolence"]
if any(s in html_lower for s in funeral_signals):
return {"valid": True, "confidence": "probable",
"reason": "funeral content found, name not confirmed"}
return {"valid": False, "confidence": "low",
"reason": "business name not found on page"}
def guess_urls(business_name: str) -> list[str]:
"""Generate candidate URLs from a business name."""
# Clean name for domain guessing
slug = business_name.lower().strip()
slug = re.sub(r"[''`]", "", slug)
slug = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug)
slug = re.sub(r"[^a-z0-9]+", "", slug)
# Also try hyphenated version
slug_hyphen = business_name.lower().strip()
slug_hyphen = re.sub(r"[''`]", "", slug_hyphen)
slug_hyphen = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug_hyphen)
slug_hyphen = re.sub(r"[^a-z0-9]+", "-", slug_hyphen).strip("-")
candidates = []
for s in [slug, slug_hyphen]:
if s:
candidates.append(f"https://www.{s}.com.au")
candidates.append(f"https://{s}.com.au")
return candidates
def discover_website(name: str, suburb: str | None, state: str | None,
phone: str | None = None) -> dict | None:
"""Attempt to discover a business website.
Returns: {url, confidence, method, validation} or None.
"""
# Build search query
query_parts = [name]
if suburb:
query_parts.append(suburb)
if state:
query_parts.append(state)
query = " ".join(query_parts)
# Strategy 1: Serper.dev (Google results, 2500 free)
results = search_serper(query)
# Strategy 2: DuckDuckGo fallback
if not results:
results = search_ddg(query)
for url in results[:3]:
validation = validate_url(url, name)
if validation["valid"]:
return {
"url": url.rstrip("/"),
"confidence": validation["confidence"],
"method": "search",
"validation": validation,
}
time.sleep(0.5)
# Strategy 2: URL guessing
candidates = guess_urls(name)
for url in candidates:
try:
validation = validate_url(url, name)
if validation["valid"]:
return {
"url": url.rstrip("/"),
"confidence": validation["confidence"],
"method": "guess",
"validation": validation,
}
except Exception:
continue
time.sleep(0.3)
return None
def run(limit: int | None = None, state_filter: str | None = None):
"""Discover websites for all providers without one.
Args:
limit: Max providers to process (for testing).
state_filter: Only process providers in this state.
"""
db = get_db()
query = """
SELECT id, title, business_suburb, business_state, phone
FROM funeral_brand
WHERE website IS NULL AND verified = 0
"""
params = []
if state_filter:
query += " AND business_state = ?"
params.append(state_filter)
query += " ORDER BY id"
if limit:
query += f" LIMIT {limit}"
providers = db.execute(query, params).fetchall()
print(f"Providers without websites: {len(providers)}")
found = 0
not_found = 0
for i, prov in enumerate(providers):
name = prov["title"]
suburb = prov["business_suburb"]
state = prov["business_state"]
phone = prov["phone"]
if (i + 1) % 10 == 0 or i == 0:
print(f" [{i+1}/{len(providers)}] Processing: {name}")
result = discover_website(name, suburb, state, phone)
if result:
db.execute(
"""UPDATE funeral_brand
SET website = ?, updated_at = datetime('now')
WHERE id = ?""",
(result["url"], prov["id"])
)
found += 1
if (i + 1) <= 20 or result["confidence"] == "confirmed":
print(f" FOUND ({result['confidence']}, {result['method']}): "
f"{result['url']}")
else:
not_found += 1
if (i + 1) % 20 == 0:
db.commit()
# Rate limit: ~2s between providers (DDG + validation requests)
time.sleep(CRAWL_DELAY * 2)
db.commit()
print(f"\nDone: {found} websites found, {not_found} not found")
print(f" Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
db.close()
if __name__ == "__main__":
import sys
limit = None
state = None
for arg in sys.argv[1:]:
if arg.startswith("--state="):
state = arg.split("=")[1]
elif arg.startswith("--limit="):
limit = int(arg.split("=")[1])
else:
try:
limit = int(arg)
except ValueError:
pass
run(limit=limit, state_filter=state)

393
crawlers/enrich_websites.py Normal file
View File

@@ -0,0 +1,393 @@
"""Website enrichment module.
For each provider with a website but no packages yet, crawls their site
to find pricing/packages pages and extracts structured data.
Two extraction modes:
1. Direct HTML parsing (for sites with clear pricing structure)
2. AI extraction via API call (for complex/varied layouts)
This module handles the crawling and page discovery.
AI extraction is delegated to the N8N workflow (Claude Haiku node).
"""
import json
import re
import time
import urllib.parse
import urllib.error
from pathlib import Path
from base import fetch_url, get_db, CRAWL_DELAY
# Common URL patterns for pricing/packages pages
PRICING_PATHS = [
"/pricing",
"/prices",
"/our-prices",
"/packages",
"/funeral-packages",
"/services",
"/our-services",
"/funeral-costs",
"/funeral-services",
"/service-options",
"/price-list",
"/transparency",
"/funeral-pricing",
"/costs",
"/cremation",
"/cremation-packages",
"/burial",
"/plan-a-funeral",
"/arrange",
]
# Keywords that suggest a link leads to pricing
PRICING_KEYWORDS = [
"pric", "cost", "packag", "service", "plan",
"cremation", "burial", "funeral",
"transparency", "disclosure",
]
def find_pricing_page(base_url: str, homepage_html: str) -> str | None:
"""Try to find the pricing/packages page URL.
Strategy:
1. Try common URL patterns
2. Parse homepage links for pricing-related keywords
"""
base = base_url.rstrip("/")
# Strategy 1: Try common paths
for path in PRICING_PATHS:
test_url = base + path
try:
html = fetch_url(test_url, timeout=10)
# Verify it's not a 404 soft-redirect (check for pricing content)
if len(html) > 1000 and ("$" in html or "price" in html.lower()):
return test_url
except (urllib.error.HTTPError, urllib.error.URLError, Exception):
continue
time.sleep(0.3)
# Strategy 2: Parse homepage links
link_pattern = re.compile(
r'<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>',
re.IGNORECASE | re.DOTALL
)
for match in link_pattern.finditer(homepage_html):
href = match.group(1)
text = re.sub(r"<[^>]+>", "", match.group(2)).lower().strip()
href_lower = href.lower()
# Check if link text or URL contains pricing keywords
if any(kw in text or kw in href_lower for kw in PRICING_KEYWORDS):
# Resolve relative URLs
if href.startswith("/"):
full_url = base + href
elif href.startswith("http"):
# Only follow links to the same domain
if urllib.parse.urlparse(base).netloc in href:
full_url = href
else:
continue
else:
full_url = base + "/" + href
try:
html = fetch_url(full_url, timeout=10)
if len(html) > 500:
return full_url
except Exception:
continue
time.sleep(0.3)
return None
def extract_description(html: str) -> str | None:
"""Extract a business description from homepage HTML."""
# Try meta description first
meta_match = re.search(
r'<meta\s+(?:name="description"\s+content="([^"]+)"|content="([^"]+)"\s+name="description")',
html, re.IGNORECASE
)
if meta_match:
desc = meta_match.group(1) or meta_match.group(2)
if desc and len(desc) > 20:
return desc.strip()
# Try OG description
og_match = re.search(
r'<meta\s+property="og:description"\s+content="([^"]+)"',
html, re.IGNORECASE
)
if og_match and len(og_match.group(1)) > 20:
return og_match.group(1).strip()
return None
def extract_contact_info(html: str) -> dict:
"""Extract contact details from HTML."""
info = {}
# Phone
phone_match = re.search(r'href="tel:([^"]+)"', html)
if phone_match:
info["phone"] = phone_match.group(1).strip()
# Email
email_match = re.search(r'href="mailto:([^"?]+)"', html)
if email_match:
info["email"] = email_match.group(1).strip()
# Address from JSON-LD
addr_match = re.search(r'"streetAddress"\s*:\s*"([^"]*)"', html)
if addr_match:
info["address"] = addr_match.group(1)
return info
def check_has_pricing(html: str) -> bool:
"""Quick check whether a page contains pricing information."""
# Look for dollar signs near numbers
price_pattern = re.compile(r'\$[\d,]+(?:\.\d{2})?')
prices_found = price_pattern.findall(html)
# Filter out tiny amounts (likely not funeral pricing)
significant_prices = []
for p in prices_found:
cleaned = p.replace("$", "").replace(",", "").strip()
if not cleaned:
continue
try:
amount = float(cleaned)
except ValueError:
continue
if amount >= 100:
significant_prices.append(amount)
return len(significant_prices) >= 1
def prepare_for_ai_extraction(html: str) -> str:
"""Clean HTML for AI extraction — remove noise, keep content."""
# Remove script and style tags
cleaned = re.sub(r"<script[^>]*>.*?</script>", "", html,
flags=re.DOTALL | re.IGNORECASE)
cleaned = re.sub(r"<style[^>]*>.*?</style>", "", cleaned,
flags=re.DOTALL | re.IGNORECASE)
# Remove HTML comments
cleaned = re.sub(r"<!--.*?-->", "", cleaned, flags=re.DOTALL)
# Remove nav, header, footer elements
for tag in ["nav", "header", "footer"]:
cleaned = re.sub(
rf"<{tag}[^>]*>.*?</{tag}>", "", cleaned,
flags=re.DOTALL | re.IGNORECASE
)
# Strip remaining tags but keep text
text = re.sub(r"<[^>]+>", " ", cleaned)
# Collapse whitespace
text = re.sub(r"\s+", " ", text).strip()
# Truncate to ~8000 chars (fits well within Haiku context)
if len(text) > 8000:
text = text[:8000] + "..."
return text
def enrich_provider(provider_id: int, website: str, db) -> dict:
"""Crawl a provider's website and extract enrichment data.
Returns a dict with what was found.
"""
result = {
"homepage_fetched": False,
"description": None,
"contact_info": {},
"pricing_page_url": None,
"has_pricing": False,
"pricing_page_text": None, # cleaned text for AI extraction
"pdf_links": [],
}
# Step 1: Fetch homepage
try:
homepage = fetch_url(website, timeout=15)
result["homepage_fetched"] = True
except Exception as e:
result["error"] = str(e)[:200]
return result
# Step 2: Extract description and contact info
result["description"] = extract_description(homepage)
result["contact_info"] = extract_contact_info(homepage)
# Step 3: Find pricing page
time.sleep(CRAWL_DELAY)
pricing_url = find_pricing_page(website, homepage)
if pricing_url:
result["pricing_page_url"] = pricing_url
try:
pricing_html = fetch_url(pricing_url, timeout=15)
result["has_pricing"] = check_has_pricing(pricing_html)
result["pricing_page_text"] = prepare_for_ai_extraction(pricing_html)
# Check for PDF links
pdf_links = re.findall(
r'href="([^"]*\.pdf[^"]*)"', pricing_html, re.IGNORECASE
)
for pdf_href in pdf_links:
if pdf_href.startswith("/"):
pdf_href = website.rstrip("/") + pdf_href
elif not pdf_href.startswith("http"):
pdf_href = website.rstrip("/") + "/" + pdf_href
result["pdf_links"].append(pdf_href)
except Exception:
pass
else:
# Check homepage itself for pricing
if check_has_pricing(homepage):
result["has_pricing"] = True
result["pricing_page_url"] = website
result["pricing_page_text"] = prepare_for_ai_extraction(homepage)
return result
def run(limit: int | None = None, state_filter: str | None = None):
"""Enrich all providers that have a website but no packages."""
db = get_db()
query = """
SELECT fb.id, fb.title, fb.website, fb.business_state
FROM funeral_brand fb
LEFT JOIN package p ON p.brand_id = fb.id
WHERE fb.website IS NOT NULL
AND fb.verified = 0
AND p.id IS NULL
"""
params = []
if state_filter:
query += " AND fb.business_state = ?"
params.append(state_filter)
query += " ORDER BY fb.id"
if limit:
query += f" LIMIT {limit}"
providers = db.execute(query, params).fetchall()
print(f"Providers to enrich: {len(providers)}")
enriched = 0
pricing_found = 0
failed = 0
for i, prov in enumerate(providers):
if (i + 1) % 5 == 0 or i == 0:
print(f" [{i+1}/{len(providers)}] {prov['title']}")
result = enrich_provider(prov["id"], prov["website"], db)
if not result["homepage_fetched"]:
failed += 1
db.execute(
"""UPDATE funeral_brand
SET enrichment_status = 'failed', updated_at = datetime('now')
WHERE id = ?""",
(prov["id"],)
)
continue
enriched += 1
# Update brand with discovered info
updates = {}
if result["description"] and not db.execute(
"SELECT description FROM funeral_brand WHERE id = ?", (prov["id"],)
).fetchone()["description"]:
updates["description"] = result["description"]
contact = result["contact_info"]
brand = db.execute("SELECT * FROM funeral_brand WHERE id = ?",
(prov["id"],)).fetchone()
if contact.get("email") and not brand["email"]:
updates["email"] = contact["email"]
if contact.get("phone") and not brand["phone"]:
updates["phone"] = contact["phone"]
if result["has_pricing"]:
pricing_found += 1
updates["enrichment_status"] = "partial" # has pricing, needs AI extraction
else:
updates["enrichment_status"] = "partial" # homepage enriched, no pricing
if updates:
set_parts = [f"{k} = ?" for k in updates]
values = list(updates.values()) + [prov["id"]]
db.execute(
f"UPDATE funeral_brand SET {', '.join(set_parts)}, "
f"updated_at = datetime('now') WHERE id = ?",
values
)
# Store pricing page text for later AI extraction
if result["pricing_page_text"]:
db.execute(
"""INSERT OR REPLACE INTO source_record
(source_name, source_id, source_url, raw_data,
matched_brand_id, match_type)
VALUES ('website_crawl', ?, ?, ?, ?, 'enrichment')""",
(
f"brand_{prov['id']}",
result["pricing_page_url"],
json.dumps({
"pricing_text": result["pricing_page_text"],
"pdf_links": result["pdf_links"],
"has_pricing": result["has_pricing"],
}),
prov["id"],
)
)
if (i + 1) % 10 == 0:
db.commit()
time.sleep(CRAWL_DELAY)
db.commit()
print(f"\nDone: {enriched} enriched, {pricing_found} with pricing, {failed} failed")
db.close()
if __name__ == "__main__":
import sys
limit = None
state = None
for arg in sys.argv[1:]:
if arg.startswith("--state="):
state = arg.split("=")[1]
elif arg.startswith("--limit="):
limit = int(arg.split("=")[1])
else:
try:
limit = int(arg)
except ValueError:
pass
run(limit=limit, state_filter=state)

199
crawlers/lookup_abn.py Normal file
View File

@@ -0,0 +1,199 @@
"""ABN Lookup module via the Australian Business Register (ABR) API.
Enriches providers with their ABN (strongest dedup key) and validates
that they are active registered businesses.
The ABR API is FREE. Requires a GUID (authentication token) from:
https://abr.business.gov.au/Tools/WebServices
Configuration:
Set ABR_GUID env var or in config.json.
"""
import json
import os
import re
import time
import urllib.parse
import xml.etree.ElementTree as ET
from base import fetch_url, get_db, CRAWL_DELAY
# Load ABR GUID from env or config
ABR_GUID = os.environ.get("ABR_GUID")
if not ABR_GUID:
config_path = os.path.join(os.path.dirname(__file__), "config.json")
if os.path.exists(config_path):
with open(config_path) as f:
config = json.load(f)
ABR_GUID = config.get("abr_guid")
ABR_BASE = "https://abr.business.gov.au/abrxmlsearch/AbrXmlSearch.asmx"
def search_by_name(name: str, state: str | None = None,
postcode: str | None = None) -> list[dict]:
"""Search ABR by business name. Returns matching records."""
if not ABR_GUID:
print(" WARNING: ABR_GUID not configured. Skipping ABN lookup.")
return []
params = {
"name": name,
"postcode": postcode or "",
"legalName": "Y",
"tradingName": "Y",
"NSW": "Y", "SA": "Y", "ACT": "Y", "VIC": "Y",
"WA": "Y", "NT": "Y", "QLD": "Y", "TAS": "Y",
"authenticationGuid": ABR_GUID,
}
# If state specified, only search that state
if state:
for s in ["NSW", "SA", "ACT", "VIC", "WA", "NT", "QLD", "TAS"]:
params[s] = "Y" if s == state else "N"
url = f"{ABR_BASE}/ABRSearchByNameSimpleProtocol"
try:
text = fetch_url(url, method="GET", data=params, timeout=15)
except Exception as e:
return []
# Parse XML response
results = []
try:
root = ET.fromstring(text)
# The ABR response uses a default namespace
ns = {"abr": "http://abr.business.gov.au/ABRXMLSearch/"}
for record in root.findall(".//abr:searchResultsRecord", ns):
abn_elem = record.find(".//abr:ABN/abr:identifierValue", ns)
status_elem = record.find(".//abr:ABN/abr:identifierStatus", ns)
name_elem = (
record.find(".//abr:mainName/abr:organisationName", ns)
or record.find(".//abr:mainTradingName/abr:organisationName", ns)
or record.find(".//abr:businessName/abr:organisationName", ns)
)
state_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:stateCode", ns)
postcode_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:postcode", ns)
score_elem = record.find(".//abr:nameScore", ns)
if abn_elem is not None:
results.append({
"abn": abn_elem.text,
"status": status_elem.text if status_elem is not None else None,
"name": name_elem.text if name_elem is not None else None,
"state": state_elem.text if state_elem is not None else None,
"postcode": postcode_elem.text if postcode_elem is not None else None,
"score": int(score_elem.text) if score_elem is not None else 0,
})
except ET.ParseError:
return []
return results
def find_best_match(name: str, state: str | None = None,
postcode: str | None = None) -> dict | None:
"""Find the best ABR match for a business name.
Returns the highest-scoring active match, or None.
"""
results = search_by_name(name, state, postcode)
# Filter to active businesses
active = [r for r in results if r.get("status") == "Active"]
if not active:
return None
# Sort by score descending
active.sort(key=lambda r: r.get("score", 0), reverse=True)
# Return best match if score is reasonable
best = active[0]
if best.get("score", 0) >= 80:
return best
return None
def run(limit: int | None = None, state_filter: str | None = None):
"""Look up ABNs for all providers that don't have one."""
db = get_db()
query = """
SELECT id, title, business_state, business_postcode
FROM funeral_brand
WHERE abn IS NULL AND verified = 0
"""
params = []
if state_filter:
query += " AND business_state = ?"
params.append(state_filter)
query += " ORDER BY id"
if limit:
query += f" LIMIT {limit}"
providers = db.execute(query, params).fetchall()
print(f"Providers without ABN: {len(providers)}")
if not ABR_GUID:
print("ERROR: ABR_GUID not configured.")
print(" Register at: https://abr.business.gov.au/Tools/WebServices")
print(" Then set ABR_GUID env var or add 'abr_guid' to config.json")
return
found = 0
not_found = 0
for i, prov in enumerate(providers):
if (i + 1) % 20 == 0 or i == 0:
print(f" [{i+1}/{len(providers)}] {prov['title']}")
match = find_best_match(
prov["title"],
prov["business_state"],
prov["business_postcode"]
)
if match:
db.execute(
"UPDATE funeral_brand SET abn = ?, updated_at = datetime('now') WHERE id = ?",
(match["abn"], prov["id"])
)
found += 1
else:
not_found += 1
if (i + 1) % 50 == 0:
db.commit()
time.sleep(0.5) # Be gentle with the government API
db.commit()
print(f"\nDone: {found} ABNs found, {not_found} not found")
print(f" Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
db.close()
if __name__ == "__main__":
import sys
limit = None
state = None
for arg in sys.argv[1:]:
if arg.startswith("--state="):
state = arg.split("=")[1]
elif arg.startswith("--limit="):
limit = int(arg.split("=")[1])
else:
try:
limit = int(arg)
except ValueError:
pass
run(limit=limit, state_filter=state)

111
crawlers/run_overnight.sh Executable file
View File

@@ -0,0 +1,111 @@
#!/bin/bash
# Full pipeline overnight run
# Usage: ./run_overnight.sh
#
# Before running:
# 1. Add your Serper API key to config.json
# 2. Optionally add your Anthropic API key for AI extraction
#
# This script runs all steps sequentially and logs everything.
set -e
cd "$(dirname "$0")"
LOG="../logs/overnight_$(date +%Y%m%d_%H%M%S).log"
mkdir -p ../logs
echo "=== OVERNIGHT PIPELINE RUN ===" | tee "$LOG"
echo "Started: $(date)" | tee -a "$LOG"
echo "" | tee -a "$LOG"
# Check config
SERPER_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('serper_api_key') or '')")
ANTHROPIC_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('anthropic_api_key') or '')")
if [ -z "$SERPER_KEY" ]; then
echo "WARNING: No Serper API key — website discovery will use DDG (slower, lower hit rate)" | tee -a "$LOG"
else
echo "Serper API key: configured" | tee -a "$LOG"
fi
if [ -z "$ANTHROPIC_KEY" ]; then
echo "WARNING: No Anthropic API key — AI extraction will be skipped" | tee -a "$LOG"
else
echo "Anthropic API key: configured" | tee -a "$LOG"
fi
echo "" | tee -a "$LOG"
# Step 1: Source crawlers
echo "=== STEP 1: Source Crawlers ===" | tee -a "$LOG"
echo "[$(date +%H:%M:%S)] Running VIC Register crawler..." | tee -a "$LOG"
python3 crawl_vic_register.py 2>&1 | tee -a "$LOG"
echo "[$(date +%H:%M:%S)] Running Funerals Australia crawler..." | tee -a "$LOG"
python3 crawl_funerals_australia.py 2>&1 | tee -a "$LOG"
echo "[$(date +%H:%M:%S)] Running NFDA crawler..." | tee -a "$LOG"
python3 crawl_nfda.py 2>&1 | tee -a "$LOG"
# Step 2: Deduplication
echo "" | tee -a "$LOG"
echo "=== STEP 2: Deduplication ===" | tee -a "$LOG"
echo "[$(date +%H:%M:%S)] Running dedup..." | tee -a "$LOG"
python3 dedup.py 2>&1 | tee -a "$LOG"
# Step 3: Website discovery (all providers without one)
echo "" | tee -a "$LOG"
echo "=== STEP 3: Website Discovery ===" | tee -a "$LOG"
NEED_WEBSITE=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NULL AND verified=0').fetchone()[0])")
echo "[$(date +%H:%M:%S)] Providers needing websites: $NEED_WEBSITE" | tee -a "$LOG"
# Process in batches of 200 to avoid issues
BATCH=200
OFFSET=0
while [ $OFFSET -lt $NEED_WEBSITE ]; do
REMAINING=$((NEED_WEBSITE - OFFSET))
CURRENT=$((REMAINING < BATCH ? REMAINING : BATCH))
echo "[$(date +%H:%M:%S)] Discovering websites batch $((OFFSET/BATCH + 1)) ($CURRENT providers)..." | tee -a "$LOG"
python3 discover_websites.py --limit=$CURRENT 2>&1 | tee -a "$LOG"
OFFSET=$((OFFSET + BATCH))
# Brief pause between batches
sleep 5
done
# Step 4: Website enrichment (all with website, not yet enriched)
echo "" | tee -a "$LOG"
echo "=== STEP 4: Website Enrichment ===" | tee -a "$LOG"
NEED_ENRICH=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL AND enrichment_status=\"pending\" AND verified=0').fetchone()[0])")
echo "[$(date +%H:%M:%S)] Providers needing enrichment: $NEED_ENRICH" | tee -a "$LOG"
python3 enrich_websites.py --limit=$NEED_ENRICH 2>&1 | tee -a "$LOG"
# Step 5: Compute tiers
echo "" | tee -a "$LOG"
echo "=== STEP 5: Compute Tiers ===" | tee -a "$LOG"
python3 compute_tiers.py 2>&1 | tee -a "$LOG"
# Final summary
echo "" | tee -a "$LOG"
echo "=== FINAL SUMMARY ===" | tee -a "$LOG"
python3 -c "
from base import get_db
db = get_db()
print('Database Status:')
print(f' Total providers: {db.execute(\"SELECT COUNT(*) FROM funeral_brand\").fetchone()[0]}')
print(f' With phone: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE phone IS NOT NULL\").fetchone()[0]}')
print(f' With email: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE email IS NOT NULL\").fetchone()[0]}')
print(f' With website: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL\").fetchone()[0]}')
print(f' With description: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE description IS NOT NULL\").fetchone()[0]}')
print()
print('Listing Tiers:')
for row in db.execute('SELECT listing_tier, COUNT(*) as n FROM funeral_brand GROUP BY listing_tier ORDER BY n DESC'):
print(f' {row[0]:12s} {row[1]:>6d}')
print()
print('Pricing Pages:')
print(f' Total crawled: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\'\").fetchone()[0]}')
print(f' With pricing: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.has_pricing\\')=1\").fetchone()[0]}')
print(f' With PDF links: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.pdf_links\\') != \\'[]\\'\").fetchone()[0]}')
" 2>&1 | tee -a "$LOG"
echo "" | tee -a "$LOG"
echo "Finished: $(date)" | tee -a "$LOG"
echo "Log saved to: $LOG"