Initial commit: funeral provider discovery pipeline
Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
This commit is contained in:
215
crawlers/PIPELINE.md
Normal file
215
crawlers/PIPELINE.md
Normal file
@@ -0,0 +1,215 @@
|
||||
# Provider Discovery & Enrichment Pipeline
|
||||
|
||||
## Architecture: Multi-Step Enrichment
|
||||
|
||||
The pipeline builds provider profiles progressively, never relying on
|
||||
competitor data. Each step adds richer detail from more authoritative sources.
|
||||
|
||||
```
|
||||
STEP 1: DISCOVER STEP 2: FIND WEBSITE STEP 3: ENRICH
|
||||
───────────────── ──────────────────── ──────────────
|
||||
|
||||
VIC Register ─────┐ ┌─ Fetch homepage
|
||||
NFDA Directory ───┼─▶ Basic Google Places API ──┐ │ Find /pricing page
|
||||
Funerals AU ──────┘ Provider ABN Lookup ─────────┼─▶ URL ──┤ Download PDFs
|
||||
Record Search engines ─────┘ │ AI extract packages
|
||||
└─▶ Structured data
|
||||
name website URL description
|
||||
address Google rating packages[]
|
||||
phone Google reviews inclusions[]
|
||||
email place_id pricing
|
||||
state ABN (validated)
|
||||
```
|
||||
|
||||
## Step 1: Discovery (DONE — all modules built and tested)
|
||||
|
||||
Sources:
|
||||
- VIC Consumer Affairs Register (796 records, VIC only) → `crawl_vic_register.py`
|
||||
- Funerals Australia AJAX API (997 records, national) → `crawl_funerals_australia.py`
|
||||
- NFDA WPSL API (209 records, national) → `crawl_nfda.py`
|
||||
|
||||
Orchestrator: `crawl_all.py`
|
||||
Deduplication: `dedup.py` (fuzzy name + postcode + ABN matching)
|
||||
|
||||
Output: ~1,463 unique providers with basic contact info.
|
||||
Stored in: funeral_brand + location tables in `database/providers.db`.
|
||||
|
||||
## Step 2: Website Discovery (DONE — module built and tested)
|
||||
|
||||
Module: `discover_websites.py`
|
||||
Test result: 50% success rate on initial batch (DDG search + URL guessing)
|
||||
Can be improved with Google Places API for higher hit rate.
|
||||
|
||||
For each provider that lacks a website URL:
|
||||
|
||||
### 2a. Serper.dev — Google search API (PRIMARY)
|
||||
- Input: "{business name} {suburb} {state}"
|
||||
- Returns: Google organic search results as JSON (title, link, snippet)
|
||||
- Cost: **2,500 free queries** (no CC needed), then $1/1K
|
||||
- Covers our entire 1,463 providers for $0
|
||||
- Filters out directories/aggregators, validates first result
|
||||
- Module: `discover_websites.py` with `search_serper()`
|
||||
|
||||
### 2b. DuckDuckGo lite (FALLBACK)
|
||||
- Free, no API key, but aggressive rate limiting
|
||||
- Used when Serper key not configured or quota exhausted
|
||||
- Module: `discover_websites.py` with `search_ddg()`
|
||||
|
||||
### 2c. URL pattern guessing (SUPPLEMENTARY)
|
||||
- Generates candidate domains from business name (e.g. smithfunerals.com.au)
|
||||
- HTTP HEAD to check if live, then validate content
|
||||
- Module: `discover_websites.py` with `guess_urls()`
|
||||
|
||||
### 2d. ABN Lookup — Australian Business Register (ENRICHMENT)
|
||||
- Input: business name + state
|
||||
- Returns: ABN, entity status, registered state/postcode
|
||||
- Cost: **FREE** (government API, requires GUID registration)
|
||||
- Validates business is active, gives strongest dedup key
|
||||
- Does NOT return website URLs
|
||||
- Module: `lookup_abn.py`
|
||||
- Register for GUID: https://abr.business.gov.au/Tools/WebServices
|
||||
|
||||
### 2e. Google Places API (OPTIONAL PREMIUM)
|
||||
- Input: "{business name}, {suburb} {state}"
|
||||
- Returns: website, rating, review count, place_id, formatted phone
|
||||
- Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
|
||||
- Best data quality but most expensive
|
||||
- Not yet implemented — add when budget allows
|
||||
|
||||
### 2f. URL validation
|
||||
- Fetch discovered URL, verify it loads
|
||||
- Check page title/content mentions the business name
|
||||
- Reject generic directories (yellowpages, truelocal, etc.)
|
||||
- Mark confidence level: confirmed / probable / unverified
|
||||
|
||||
## Step 3: Website Enrichment (DONE — module built and tested)
|
||||
|
||||
Module: `enrich_websites.py`
|
||||
- Finds pricing pages via 20+ URL patterns + link following
|
||||
- Extracts description from meta tags
|
||||
- Extracts contact info (phone, email, address)
|
||||
- Stores cleaned pricing page text for AI extraction
|
||||
- Detects PDF links for PDF-based pricing extraction
|
||||
|
||||
For each provider with a confirmed website:
|
||||
|
||||
### 3a. Homepage crawl
|
||||
- Fetch homepage HTML
|
||||
- Extract: description/about text, contact details
|
||||
- Look for links to pricing/services pages
|
||||
|
||||
### 3b. Pricing page discovery
|
||||
Try common URL patterns:
|
||||
/pricing, /prices, /packages, /services, /our-services,
|
||||
/funeral-costs, /funeral-packages, /service-options,
|
||||
/price-list, /transparency
|
||||
|
||||
Also:
|
||||
- Parse sitemap.xml if available
|
||||
- Follow links containing "pric", "packag", "cost", "service"
|
||||
- Check for PDF links on pricing pages
|
||||
|
||||
### 3c. AI extraction (Claude Haiku)
|
||||
- Send pricing page HTML to Haiku
|
||||
- Extract: package names, funeral types, prices, inclusions
|
||||
- Map to known inclusion types where possible
|
||||
- Return confidence score
|
||||
|
||||
### 3d. PDF extraction (for InvoCare-type sites)
|
||||
- Download compliance PDFs
|
||||
- Extract text (pdftotext or similar)
|
||||
- Send to Haiku for structured extraction
|
||||
- ~25% of sites are PDF-only for pricing
|
||||
|
||||
## Listing Tiers
|
||||
|
||||
Providers are assigned a `listing_tier` based on data quality. Computed
|
||||
automatically by `compute_tiers.py` after each enrichment run.
|
||||
|
||||
| Tier | Label | Criteria | Display |
|
||||
|------|-------|----------|---------|
|
||||
| `verified` | Full partner | `verified = true` (signed up) | Full branding, packages, arrangements |
|
||||
| `priced` | Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail |
|
||||
| `estimated` | Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns |
|
||||
| `listed` | Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt |
|
||||
|
||||
Each tier below `verified` motivates the provider to sign up:
|
||||
- `listed` → "Publish your pricing to attract more families"
|
||||
- `estimated` → "Add detailed breakdowns to stand out"
|
||||
- `priced` → "Sign up to enable online arrangements"
|
||||
|
||||
## Enrichment Status Flow
|
||||
|
||||
```
|
||||
pending ──▶ website_found ──▶ partial ──▶ complete
|
||||
│ │ │
|
||||
└──▶ no_website_found failed (retry later)
|
||||
```
|
||||
|
||||
## N8N Workflow Design
|
||||
|
||||
### Workflow 1: Weekly Discovery
|
||||
Cron → Run all source crawlers → Dedup into DB → Queue new providers
|
||||
|
||||
### Workflow 2: Daily Website Discovery
|
||||
Cron → Fetch providers with no website → Google Places lookup
|
||||
→ ABN lookup → Search fallback → Update DB
|
||||
|
||||
### Workflow 3: Daily Enrichment
|
||||
Cron → Fetch providers with website but no packages
|
||||
→ Crawl website → AI extract → Update DB
|
||||
|
||||
### Workflow 4: Monthly Re-check
|
||||
Cron → Re-crawl enriched providers → Update pricing if changed
|
||||
|
||||
---
|
||||
|
||||
## Module Inventory
|
||||
|
||||
| Module | Purpose | N8N Workflow |
|
||||
|--------|---------|-------------|
|
||||
| `base.py` | Shared HTTP, DB, normalization utils | Used by all |
|
||||
| `crawl_vic_register.py` | VIC government register (796 records) | Workflow 1 |
|
||||
| `crawl_funerals_australia.py` | Funerals Australia API (997 records) | Workflow 1 |
|
||||
| `crawl_nfda.py` | NFDA directory API (209 records) | Workflow 1 |
|
||||
| `crawl_all.py` | Orchestrates all source crawlers | Workflow 1 |
|
||||
| `dedup.py` | Cross-source dedup & merge engine | Workflow 1 |
|
||||
| `discover_websites.py` | Find provider websites (Serper/DDG/guess) | Workflow 2 |
|
||||
| `lookup_abn.py` | ABN validation via ABR API (free) | Workflow 2 |
|
||||
| `enrich_websites.py` | Crawl provider sites, find pricing pages | Workflow 3 |
|
||||
| `compute_tiers.py` | Compute listing_tier from data quality | After enrichment |
|
||||
| `config.example.json` | API key template | — |
|
||||
|
||||
## API Keys Required
|
||||
|
||||
| Service | Key | Cost | Register |
|
||||
|---------|-----|------|----------|
|
||||
| Serper.dev | `serper_api_key` | 2,500 free, then $1/1K | https://serper.dev |
|
||||
| ABR (ABN Lookup) | `abr_guid` | Free | https://abr.business.gov.au/Tools/WebServices |
|
||||
| Anthropic (Haiku) | `anthropic_api_key` | ~$2/full run | https://console.anthropic.com |
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Configure API keys
|
||||
cp config.example.json config.json
|
||||
# Edit config.json with your keys
|
||||
|
||||
# 2. Reset database
|
||||
cd ../database
|
||||
sqlite3 providers.db < schema_sqlite.sql
|
||||
|
||||
# 3. Run full discovery pipeline
|
||||
cd ../crawlers
|
||||
python3 crawl_all.py # Step 1: Discover from registries
|
||||
python3 dedup.py # Deduplicate across sources
|
||||
python3 lookup_abn.py # Step 2a: Get ABNs (free)
|
||||
python3 discover_websites.py # Step 2b: Find websites
|
||||
python3 enrich_websites.py # Step 3: Crawl for pricing
|
||||
python3 compute_tiers.py # Assign listing tiers
|
||||
|
||||
# Test mode (limited records)
|
||||
python3 crawl_all.py --test
|
||||
python3 discover_websites.py --limit=10 --state=VIC
|
||||
python3 enrich_websites.py --limit=5
|
||||
```
|
||||
164
crawlers/base.py
Normal file
164
crawlers/base.py
Normal file
@@ -0,0 +1,164 @@
|
||||
"""Base crawler module with shared utilities."""
|
||||
|
||||
import gzip
|
||||
import io
|
||||
import json
|
||||
import time
|
||||
import sqlite3
|
||||
import urllib.request
|
||||
import urllib.parse
|
||||
import urllib.error
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
DB_PATH = Path(__file__).parent.parent / "database" / "providers.db"
|
||||
CRAWL_DELAY = 1.0 # seconds between requests (courtesy)
|
||||
|
||||
USER_AGENT = (
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
|
||||
)
|
||||
|
||||
|
||||
def fetch_url(url: str, method: str = "GET", data: dict | None = None,
|
||||
headers: dict | None = None, timeout: int = 30) -> str:
|
||||
"""Fetch a URL and return the response body as text."""
|
||||
hdrs = {"User-Agent": USER_AGENT}
|
||||
if headers:
|
||||
hdrs.update(headers)
|
||||
|
||||
body = None
|
||||
if data and method == "POST":
|
||||
body = urllib.parse.urlencode(data, doseq=True).encode("utf-8")
|
||||
hdrs.setdefault("Content-Type", "application/x-www-form-urlencoded")
|
||||
elif data and method == "GET":
|
||||
url = url + "?" + urllib.parse.urlencode(data, doseq=True)
|
||||
|
||||
req = urllib.request.Request(url, data=body, headers=hdrs, method=method)
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
raw = resp.read()
|
||||
# Handle gzip-compressed responses
|
||||
if resp.headers.get("Content-Encoding") == "gzip" or raw[:2] == b"\x1f\x8b":
|
||||
raw = gzip.decompress(raw)
|
||||
charset = resp.headers.get_content_charset() or "utf-8"
|
||||
return raw.decode(charset)
|
||||
|
||||
|
||||
def fetch_json(url: str, method: str = "GET", data: dict | None = None,
|
||||
headers: dict | None = None) -> dict:
|
||||
"""Fetch a URL and parse the response as JSON."""
|
||||
text = fetch_url(url, method=method, data=data, headers=headers)
|
||||
return json.loads(text)
|
||||
|
||||
|
||||
def get_db() -> sqlite3.Connection:
|
||||
"""Get a connection to the SQLite database."""
|
||||
conn = sqlite3.connect(str(DB_PATH))
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
conn.row_factory = sqlite3.Row
|
||||
return conn
|
||||
|
||||
|
||||
def start_crawl_log(db: sqlite3.Connection, source_name: str) -> int:
|
||||
"""Create a source_log entry and return its ID."""
|
||||
cur = db.execute(
|
||||
"INSERT INTO source_log (source_name) VALUES (?)",
|
||||
(source_name,)
|
||||
)
|
||||
db.commit()
|
||||
return cur.lastrowid
|
||||
|
||||
|
||||
def finish_crawl_log(db: sqlite3.Connection, log_id: int,
|
||||
found: int, new: int, updated: int, skipped: int,
|
||||
status: str = "completed", error: str | None = None):
|
||||
"""Update a source_log entry with results."""
|
||||
db.execute(
|
||||
"""UPDATE source_log
|
||||
SET run_finished_at = datetime('now'),
|
||||
records_found = ?, records_new = ?,
|
||||
records_updated = ?, records_skipped = ?,
|
||||
status = ?, error_message = ?
|
||||
WHERE id = ?""",
|
||||
(found, new, updated, skipped, status, error, log_id)
|
||||
)
|
||||
db.commit()
|
||||
|
||||
|
||||
def store_source_record(db: sqlite3.Connection, source_name: str,
|
||||
source_id: str, source_url: str | None,
|
||||
raw_data: dict, log_id: int) -> int | None:
|
||||
"""Store a raw source record. Returns the row ID, or None if duplicate."""
|
||||
try:
|
||||
cur = db.execute(
|
||||
"""INSERT INTO source_record
|
||||
(source_name, source_id, source_url, raw_data, log_id)
|
||||
VALUES (?, ?, ?, ?, ?)""",
|
||||
(source_name, source_id, source_url, json.dumps(raw_data), log_id)
|
||||
)
|
||||
db.commit()
|
||||
return cur.lastrowid
|
||||
except sqlite3.IntegrityError:
|
||||
# Duplicate source_name + source_id — already have this record
|
||||
return None
|
||||
|
||||
|
||||
def normalize_phone(phone: str | None) -> str | None:
|
||||
"""Basic phone normalization."""
|
||||
if not phone:
|
||||
return None
|
||||
# Remove common noise
|
||||
phone = phone.strip().replace("\xa0", " ")
|
||||
# If multiple numbers, take the first
|
||||
for sep in [";", "/", "|", ","]:
|
||||
if sep in phone:
|
||||
phone = phone.split(sep)[0].strip()
|
||||
return phone or None
|
||||
|
||||
|
||||
def normalize_state(state: str | None) -> str | None:
|
||||
"""Normalize Australian state names to abbreviations."""
|
||||
if not state:
|
||||
return None
|
||||
state = state.strip().upper()
|
||||
mapping = {
|
||||
"NEW SOUTH WALES": "NSW",
|
||||
"VICTORIA": "VIC",
|
||||
"QUEENSLAND": "QLD",
|
||||
"SOUTH AUSTRALIA": "SA",
|
||||
"WESTERN AUSTRALIA": "WA",
|
||||
"TASMANIA": "TAS",
|
||||
"NORTHERN TERRITORY": "NT",
|
||||
"AUSTRALIAN CAPITAL TERRITORY": "ACT",
|
||||
"AUSTRALIA CAPITAL TERRITORY": "ACT",
|
||||
}
|
||||
result = mapping.get(state, state)
|
||||
# Only return valid Australian states
|
||||
valid = {"NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"}
|
||||
return result if result in valid else None
|
||||
|
||||
|
||||
def generate_slug(name: str) -> str:
|
||||
"""Generate a URL-safe slug from a business name."""
|
||||
import re
|
||||
slug = name.lower().strip()
|
||||
slug = re.sub(r"[''`]", "", slug) # remove apostrophes
|
||||
slug = re.sub(r"[^a-z0-9]+", "-", slug) # non-alphanum -> hyphen
|
||||
slug = slug.strip("-")
|
||||
return slug
|
||||
|
||||
|
||||
def to_intermediate(source: str, source_id: str, source_url: str | None,
|
||||
business: dict, locations: list[dict],
|
||||
packages: list[dict] | None = None) -> dict:
|
||||
"""Build the normalized intermediate format record."""
|
||||
return {
|
||||
"source": source,
|
||||
"sourceId": source_id,
|
||||
"sourceUrl": source_url,
|
||||
"scrapedAt": datetime.now(timezone.utc).isoformat(),
|
||||
"business": business,
|
||||
"locations": locations,
|
||||
"packages": packages or [],
|
||||
}
|
||||
102
crawlers/compute_tiers.py
Normal file
102
crawlers/compute_tiers.py
Normal file
@@ -0,0 +1,102 @@
|
||||
"""Compute listing_tier for all providers based on their data quality.
|
||||
|
||||
Tier logic:
|
||||
verified — brand.verified = true (signed up to platform)
|
||||
priced — has 2+ packages with at least one inclusion that has a price > 0
|
||||
estimated — has at least one package with a total price > 0
|
||||
listed — everything else (contact info only)
|
||||
|
||||
Run this after enrichment to update tiers across the board.
|
||||
"""
|
||||
|
||||
from base import get_db
|
||||
|
||||
|
||||
def compute_tier(db, brand_id: int, verified: bool) -> str:
|
||||
"""Compute the listing tier for a single brand."""
|
||||
if verified:
|
||||
return "verified"
|
||||
|
||||
# Check packages
|
||||
packages = db.execute(
|
||||
"SELECT id, title, funeral_type FROM package WHERE brand_id = ?",
|
||||
(brand_id,)
|
||||
).fetchall()
|
||||
|
||||
if not packages:
|
||||
return "listed"
|
||||
|
||||
# Count packages that have a meaningful total price
|
||||
# A package's price = sum of non-optional, non-complimentary inclusions
|
||||
packages_with_price = 0
|
||||
packages_with_itemized = 0
|
||||
|
||||
for pkg in packages:
|
||||
inclusions = db.execute(
|
||||
"""SELECT price, optional, complimentary
|
||||
FROM package_inclusion
|
||||
WHERE package_id = ?""",
|
||||
(pkg["id"],)
|
||||
).fetchall()
|
||||
|
||||
if inclusions:
|
||||
# Has itemized inclusions with prices
|
||||
priced_inclusions = [
|
||||
i for i in inclusions
|
||||
if i["price"] and float(i["price"]) > 0
|
||||
]
|
||||
if len(priced_inclusions) >= 2:
|
||||
packages_with_itemized += 1
|
||||
packages_with_price += 1
|
||||
elif len(priced_inclusions) >= 1:
|
||||
packages_with_price += 1
|
||||
else:
|
||||
# Package exists but no inclusions — check if we stored a total
|
||||
# price in the package description or via source data
|
||||
# For now, a package with a funeral_type means we at least know
|
||||
# what kind of service it is, even without breakdown
|
||||
packages_with_price += 1
|
||||
|
||||
# Tier 2 (priced): 2+ packages with itemized breakdowns
|
||||
if packages_with_itemized >= 2:
|
||||
return "priced"
|
||||
|
||||
# Tier 3 (estimated): at least one package with some price
|
||||
if packages_with_price >= 1:
|
||||
return "estimated"
|
||||
|
||||
return "listed"
|
||||
|
||||
|
||||
def run():
|
||||
"""Recompute listing_tier for all brands."""
|
||||
db = get_db()
|
||||
|
||||
brands = db.execute(
|
||||
"SELECT id, verified FROM funeral_brand"
|
||||
).fetchall()
|
||||
|
||||
counts = {"verified": 0, "priced": 0, "estimated": 0, "listed": 0}
|
||||
|
||||
for brand in brands:
|
||||
tier = compute_tier(db, brand["id"], brand["verified"])
|
||||
db.execute(
|
||||
"UPDATE funeral_brand SET listing_tier = ? WHERE id = ?",
|
||||
(tier, brand["id"])
|
||||
)
|
||||
counts[tier] += 1
|
||||
|
||||
db.commit()
|
||||
|
||||
print("Listing Tier Distribution:")
|
||||
print(f" verified: {counts['verified']:>6d} (signed-up partners)")
|
||||
print(f" priced: {counts['priced']:>6d} (full package breakdowns)")
|
||||
print(f" estimated: {counts['estimated']:>6d} (some pricing info)")
|
||||
print(f" listed: {counts['listed']:>6d} (contact info only)")
|
||||
print(f" TOTAL: {sum(counts.values()):>6d}")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
5
crawlers/config.example.json
Normal file
5
crawlers/config.example.json
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"serper_api_key": null,
|
||||
"abr_guid": null,
|
||||
"anthropic_api_key": null
|
||||
}
|
||||
70
crawlers/crawl_all.py
Normal file
70
crawlers/crawl_all.py
Normal file
@@ -0,0 +1,70 @@
|
||||
"""Run all source crawlers and then deduplicate into the provider database."""
|
||||
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from base import get_db
|
||||
|
||||
|
||||
def run_all(gathered_here_limit: int | None = None):
|
||||
"""Run all crawlers sequentially."""
|
||||
print("=" * 60)
|
||||
print("PROVIDER DISCOVERY PIPELINE")
|
||||
print("=" * 60)
|
||||
|
||||
# Import crawlers
|
||||
import crawl_nfda
|
||||
import crawl_funerals_australia
|
||||
import crawl_vic_register
|
||||
import crawl_gathered_here
|
||||
|
||||
# Run in order: fast API sources first, then slower HTML scraping
|
||||
print("\n--- 1/4: NFDA Directory ---")
|
||||
crawl_nfda.run()
|
||||
|
||||
print("\n--- 2/4: Funerals Australia ---")
|
||||
crawl_funerals_australia.run()
|
||||
|
||||
print("\n--- 3/4: VIC Consumer Affairs Register ---")
|
||||
crawl_vic_register.run()
|
||||
|
||||
print("\n--- 4/4: Gathered Here ---")
|
||||
crawl_gathered_here.run(limit=gathered_here_limit)
|
||||
|
||||
# Summary
|
||||
db = get_db()
|
||||
print("\n" + "=" * 60)
|
||||
print("CRAWL SUMMARY")
|
||||
print("=" * 60)
|
||||
|
||||
rows = db.execute(
|
||||
"""SELECT source_name,
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN matched_brand_id IS NOT NULL THEN 1 ELSE 0 END) as matched
|
||||
FROM source_record
|
||||
GROUP BY source_name"""
|
||||
).fetchall()
|
||||
|
||||
for row in rows:
|
||||
print(f" {row['source_name']:25s} {row['total']:5d} records "
|
||||
f"({row['matched']} matched)")
|
||||
|
||||
total = db.execute("SELECT COUNT(*) as n FROM source_record").fetchone()["n"]
|
||||
print(f" {'TOTAL':25s} {total:5d} records")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
limit = None
|
||||
if "--test" in sys.argv:
|
||||
limit = 10
|
||||
print("TEST MODE: Gathered Here limited to 10 profiles")
|
||||
elif len(sys.argv) > 1:
|
||||
try:
|
||||
limit = int(sys.argv[1])
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run_all(gathered_here_limit=limit)
|
||||
179
crawlers/crawl_funerals_australia.py
Normal file
179
crawlers/crawl_funerals_australia.py
Normal file
@@ -0,0 +1,179 @@
|
||||
"""Crawler for the Funerals Australia (formerly AFDA) member directory.
|
||||
|
||||
Source: https://funeralsaustralia.org.au/find-a-member/
|
||||
Method: WordPress AJAX API (POST with get_clients_list action)
|
||||
Fields: name, address (structured), phone, email, website, lat/lng, displayImage
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, normalize_state,
|
||||
generate_slug, to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "funerals_australia"
|
||||
API_URL = "https://funeralsaustralia.org.au/wp-admin/admin-ajax.php"
|
||||
|
||||
PAGE_SIZE = 200 # API supports up to 200 per page
|
||||
|
||||
|
||||
def fetch_page(offset: int = 0) -> dict:
|
||||
"""Fetch a page of all members from the Funerals Australia API.
|
||||
|
||||
The API returns all members when no postcode/suburb filter is given,
|
||||
which is more reliable than geo-filtered searches.
|
||||
"""
|
||||
form_data = {
|
||||
"action": "get_clients_list",
|
||||
"params[size]": str(PAGE_SIZE),
|
||||
"params[from]": str(offset),
|
||||
"params[forceResults]": "true",
|
||||
"params[paginated]": "true",
|
||||
}
|
||||
|
||||
text = fetch_url(API_URL, method="POST", data=form_data,
|
||||
headers={"X-Requested-With": "XMLHttpRequest"})
|
||||
return json.loads(text)
|
||||
|
||||
|
||||
def fetch_all_members() -> list[dict]:
|
||||
"""Fetch all members via pagination."""
|
||||
all_results = []
|
||||
offset = 0
|
||||
|
||||
while True:
|
||||
data = fetch_page(offset)
|
||||
results = data.get("results", [])
|
||||
total = data.get("total", 0)
|
||||
|
||||
if not results:
|
||||
break
|
||||
|
||||
all_results.extend(results)
|
||||
print(f" Fetched {len(all_results)}/{total}...")
|
||||
offset += PAGE_SIZE
|
||||
|
||||
if offset >= total:
|
||||
break
|
||||
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
return all_results
|
||||
|
||||
|
||||
def parse_address(record: dict) -> dict:
|
||||
"""Extract structured address from a Funerals Australia record."""
|
||||
addr_list = record.get("address", [])
|
||||
if addr_list and isinstance(addr_list, list) and len(addr_list) > 0:
|
||||
addr = addr_list[0]
|
||||
return {
|
||||
"line1": addr.get("line1", "").strip(),
|
||||
"city": addr.get("city", "").strip(),
|
||||
"state": normalize_state(addr.get("state")),
|
||||
"postcode": addr.get("postcode", "").strip(),
|
||||
}
|
||||
return {"line1": "", "city": "", "state": None, "postcode": ""}
|
||||
|
||||
|
||||
def to_normalized(record: dict) -> dict:
|
||||
"""Convert a Funerals Australia record to intermediate format."""
|
||||
addr = parse_address(record)
|
||||
city = addr["city"]
|
||||
if city and city == city.upper():
|
||||
city = city.title()
|
||||
|
||||
lat_val = record.get("latitude")
|
||||
lng_val = record.get("longitude")
|
||||
try:
|
||||
lat_val = float(lat_val) if lat_val else None
|
||||
lng_val = float(lng_val) if lng_val else None
|
||||
except (ValueError, TypeError):
|
||||
lat_val = lng_val = None
|
||||
|
||||
website = record.get("website", "").strip() or None
|
||||
if website and not website.startswith("http"):
|
||||
website = "https://" + website
|
||||
|
||||
business = {
|
||||
"name": record.get("name", "").strip(),
|
||||
"abn": None,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
"email": record.get("email", "").strip() or None,
|
||||
"website": website,
|
||||
"description": None,
|
||||
}
|
||||
|
||||
locations = [{
|
||||
"address": addr["line1"],
|
||||
"suburb": city,
|
||||
"state": addr["state"],
|
||||
"postcode": addr["postcode"],
|
||||
"lat": lat_val,
|
||||
"lng": lng_val,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
}]
|
||||
|
||||
source_id = record.get("id", "")
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url="https://funeralsaustralia.org.au/find-a-member/",
|
||||
business=business,
|
||||
locations=locations,
|
||||
)
|
||||
|
||||
|
||||
def run():
|
||||
"""Run the full Funerals Australia crawl."""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
all_records = []
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
|
||||
try:
|
||||
print(" Fetching all members (paginated)...")
|
||||
all_records = fetch_all_members()
|
||||
found = len(all_records)
|
||||
print(f" Total members fetched: {found}")
|
||||
|
||||
# Store records
|
||||
for record in all_records:
|
||||
source_id = record.get("id", "")
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id,
|
||||
"https://funeralsaustralia.org.au/find-a-member/",
|
||||
record, log_id
|
||||
)
|
||||
if row_id:
|
||||
normalized = to_normalized(record)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
return all_records
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
362
crawlers/crawl_gathered_here.py
Normal file
362
crawlers/crawl_gathered_here.py
Normal file
@@ -0,0 +1,362 @@
|
||||
"""Crawler for Gathered Here funeral director directory.
|
||||
|
||||
Source: https://www.gatheredhere.com.au
|
||||
Method: XML sitemap → fetch individual profile pages → parse HTML
|
||||
Fields: name, address, coords, phone, email, website, description, pricing, reviews
|
||||
"""
|
||||
|
||||
import re
|
||||
import time
|
||||
import json
|
||||
import xml.etree.ElementTree as ET
|
||||
from html.parser import HTMLParser
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, normalize_state,
|
||||
generate_slug, to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "gathered_here"
|
||||
SITEMAP_URL = "https://www.gatheredhere.com.au/sitemap/sitemap-funerals-listings-0.xml"
|
||||
BASE_URL = "https://www.gatheredhere.com.au"
|
||||
|
||||
|
||||
def fetch_all_listing_urls() -> list[str]:
|
||||
"""Fetch and parse the sitemap to get all funeral director profile URLs."""
|
||||
xml_text = fetch_url(SITEMAP_URL)
|
||||
root = ET.fromstring(xml_text)
|
||||
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
|
||||
|
||||
urls = []
|
||||
for url_elem in root.findall("sm:url", ns):
|
||||
loc = url_elem.find("sm:loc", ns)
|
||||
if loc is not None and loc.text:
|
||||
url = loc.text.strip()
|
||||
# Only include individual profile pages (singular /funeral-director/)
|
||||
if "/funeral-director/" in url and "/funeral-directors/" not in url:
|
||||
urls.append(url)
|
||||
|
||||
return urls
|
||||
|
||||
|
||||
def extract_next_data(html_text: str) -> dict | None:
|
||||
"""Extract __NEXT_DATA__ JSON from a Next.js page."""
|
||||
pattern = r'<script\s+id="__NEXT_DATA__"\s+type="application/json">(.*?)</script>'
|
||||
match = re.search(pattern, html_text, re.DOTALL)
|
||||
if match:
|
||||
try:
|
||||
return json.loads(match.group(1))
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def extract_from_next_data(next_data: dict) -> dict | None:
|
||||
"""Extract listing data from __NEXT_DATA__ props."""
|
||||
try:
|
||||
props = next_data.get("props", {}).get("pageProps", {})
|
||||
|
||||
# Structure: singleListing.listing contains the actual data
|
||||
single = props.get("singleListing", {})
|
||||
if single:
|
||||
listing = single.get("listing")
|
||||
if listing and isinstance(listing, dict):
|
||||
return listing
|
||||
|
||||
# Fallback paths
|
||||
listing = props.get("listing") or props.get("post") or props.get("data")
|
||||
return listing
|
||||
except (KeyError, TypeError):
|
||||
return None
|
||||
|
||||
|
||||
def extract_from_html(html_text: str, url: str) -> dict:
|
||||
"""Extract listing data from page HTML using regex patterns as fallback."""
|
||||
data = {"url": url}
|
||||
|
||||
# Title
|
||||
title_match = re.search(r'<h1[^>]*>(.*?)</h1>', html_text, re.DOTALL)
|
||||
if title_match:
|
||||
data["title"] = re.sub(r'<[^>]+>', '', title_match.group(1)).strip()
|
||||
|
||||
# Phone
|
||||
phone_match = re.search(r'href="tel:([^"]+)"', html_text)
|
||||
if phone_match:
|
||||
data["phone"] = phone_match.group(1).strip()
|
||||
|
||||
# Email
|
||||
email_match = re.search(r'href="mailto:([^"]+)"', html_text)
|
||||
if email_match:
|
||||
data["email"] = email_match.group(1).strip()
|
||||
|
||||
# Website
|
||||
website_match = re.search(
|
||||
r'<a[^>]*class="[^"]*website[^"]*"[^>]*href="([^"]+)"', html_text
|
||||
)
|
||||
if website_match:
|
||||
data["website"] = website_match.group(1).strip()
|
||||
|
||||
# Address from structured data
|
||||
addr_match = re.search(
|
||||
r'"streetAddress"\s*:\s*"([^"]*)"', html_text
|
||||
)
|
||||
if addr_match:
|
||||
data["address"] = addr_match.group(1)
|
||||
|
||||
locality_match = re.search(r'"addressLocality"\s*:\s*"([^"]*)"', html_text)
|
||||
if locality_match:
|
||||
data["suburb"] = locality_match.group(1)
|
||||
|
||||
region_match = re.search(r'"addressRegion"\s*:\s*"([^"]*)"', html_text)
|
||||
if region_match:
|
||||
data["state"] = region_match.group(1)
|
||||
|
||||
postcode_match = re.search(r'"postalCode"\s*:\s*"([^"]*)"', html_text)
|
||||
if postcode_match:
|
||||
data["postcode"] = postcode_match.group(1)
|
||||
|
||||
# Coordinates
|
||||
lat_match = re.search(r'"latitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
|
||||
lng_match = re.search(r'"longitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
|
||||
if lat_match:
|
||||
data["lat"] = float(lat_match.group(1))
|
||||
if lng_match:
|
||||
data["lng"] = float(lng_match.group(1))
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def extract_pricing(listing_data: dict) -> dict:
|
||||
"""Extract pricing from listing meta fields."""
|
||||
meta = listing_data.get("meta", {})
|
||||
if not meta:
|
||||
return {}
|
||||
|
||||
pricing = {}
|
||||
price_fields = {
|
||||
# With viewing prices
|
||||
"cremation_no_service_viewY": "cremation_no_service_with_viewing",
|
||||
"cremation_single_viewY": "cremation_single_service_with_viewing",
|
||||
"cremation_dual_viewY": "cremation_dual_service_with_viewing",
|
||||
"cremation_graveside_viewY": "cremation_graveside_with_viewing",
|
||||
"burial_single_viewY": "burial_single_service_with_viewing",
|
||||
"burial_dual_viewY": "burial_dual_service_with_viewing",
|
||||
"burial_graveside_viewY": "burial_graveside_with_viewing",
|
||||
"burial_no_service_viewY": "burial_no_service_with_viewing",
|
||||
# Without viewing prices
|
||||
"cremation_no_service_viewN": "cremation_no_service",
|
||||
"cremation_single_viewN": "cremation_single_service",
|
||||
"cremation_dual_viewN": "cremation_dual_service",
|
||||
"cremation_graveside_viewN": "cremation_graveside",
|
||||
"burial_single_viewN": "burial_single_service",
|
||||
"burial_dual_viewN": "burial_dual_service",
|
||||
"burial_graveside_viewN": "burial_graveside",
|
||||
"burial_no_service_viewN": "burial_no_service",
|
||||
}
|
||||
|
||||
for meta_key, label in price_fields.items():
|
||||
val = meta.get(meta_key, "")
|
||||
if val:
|
||||
# Parse price string like "$2,299" to float
|
||||
cleaned = re.sub(r'[^\d.]', '', str(val))
|
||||
if cleaned:
|
||||
try:
|
||||
pricing[label] = float(cleaned)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
return pricing
|
||||
|
||||
|
||||
def pricing_to_packages(pricing: dict) -> list[dict]:
|
||||
"""Convert flat pricing dict to package format."""
|
||||
packages = []
|
||||
|
||||
# Map pricing keys to funeral types
|
||||
type_mappings = [
|
||||
("cremation_no_service", "Cremation Only"),
|
||||
("cremation_single_service", "Service & Cremation"),
|
||||
("cremation_single_service_with_viewing", "Service & Cremation"),
|
||||
("burial_single_service", "Service & Burial"),
|
||||
("burial_graveside", "Graveside Burial"),
|
||||
]
|
||||
|
||||
for price_key, funeral_type in type_mappings:
|
||||
if price_key in pricing:
|
||||
name = price_key.replace("_", " ").title()
|
||||
packages.append({
|
||||
"name": name,
|
||||
"funeralType": funeral_type,
|
||||
"price": pricing[price_key],
|
||||
"inclusions": [], # Not available from Gathered Here listing pages
|
||||
})
|
||||
|
||||
return packages
|
||||
|
||||
|
||||
def to_normalized(listing_data: dict, url: str) -> dict:
|
||||
"""Convert Gathered Here listing data to intermediate format."""
|
||||
meta = listing_data.get("meta", {}) if isinstance(listing_data.get("meta"), dict) else {}
|
||||
|
||||
name = listing_data.get("title", listing_data.get("name", "")).strip()
|
||||
slug = listing_data.get("slug", "")
|
||||
|
||||
# Extract location
|
||||
suburb = meta.get("geolocation_city", "")
|
||||
state = normalize_state(meta.get("geolocation_state_short", ""))
|
||||
postcode = meta.get("geolocation_postcode", "")
|
||||
lat = meta.get("geolocation_lat")
|
||||
lng = meta.get("geolocation_long")
|
||||
|
||||
try:
|
||||
lat = float(lat) if lat else None
|
||||
lng = float(lng) if lng else None
|
||||
except (ValueError, TypeError):
|
||||
lat = lng = None
|
||||
|
||||
email = meta.get("email", "") or meta.get("_application", "")
|
||||
phone = meta.get("phone", "") or listing_data.get("phone", "")
|
||||
|
||||
# Try to get description from content or excerpt
|
||||
description = listing_data.get("excerpt", listing_data.get("content", ""))
|
||||
if description:
|
||||
description = re.sub(r'<[^>]+>', '', description).strip()
|
||||
if len(description) > 500:
|
||||
description = description[:497] + "..."
|
||||
|
||||
# Website
|
||||
website = listing_data.get("website") or meta.get("website") or None
|
||||
|
||||
# Pricing
|
||||
pricing = extract_pricing(listing_data)
|
||||
packages = pricing_to_packages(pricing)
|
||||
|
||||
business = {
|
||||
"name": name,
|
||||
"abn": None,
|
||||
"phone": normalize_phone(phone),
|
||||
"email": email.strip() or None,
|
||||
"website": website,
|
||||
"description": description or None,
|
||||
}
|
||||
|
||||
locations = [{
|
||||
"address": meta.get("geolocation_formatted_address", ""),
|
||||
"suburb": suburb,
|
||||
"state": state,
|
||||
"postcode": postcode,
|
||||
"lat": lat,
|
||||
"lng": lng,
|
||||
"phone": normalize_phone(phone),
|
||||
}]
|
||||
|
||||
source_id = slug or generate_slug(name)
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url=url,
|
||||
business=business,
|
||||
locations=locations,
|
||||
packages=packages,
|
||||
)
|
||||
|
||||
|
||||
def crawl_profile(url: str) -> dict | None:
|
||||
"""Crawl a single Gathered Here profile page."""
|
||||
try:
|
||||
html_text = fetch_url(url)
|
||||
except Exception as e:
|
||||
print(f" Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
# Try __NEXT_DATA__ first (structured)
|
||||
next_data = extract_next_data(html_text)
|
||||
if next_data:
|
||||
listing = extract_from_next_data(next_data)
|
||||
if listing:
|
||||
listing["_source"] = "next_data"
|
||||
return listing
|
||||
|
||||
# Fallback to HTML parsing
|
||||
data = extract_from_html(html_text, url)
|
||||
data["_source"] = "html_fallback"
|
||||
return data
|
||||
|
||||
|
||||
def run(limit: int | None = None):
|
||||
"""Run the full Gathered Here crawl.
|
||||
|
||||
Args:
|
||||
limit: If set, only crawl this many profiles (for testing).
|
||||
"""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
errors = 0
|
||||
|
||||
try:
|
||||
# Step 1: Get all profile URLs from sitemap
|
||||
print(" Fetching sitemap...", end=" ", flush=True)
|
||||
urls = fetch_all_listing_urls()
|
||||
print(f"{len(urls)} profile URLs found")
|
||||
|
||||
if limit:
|
||||
urls = urls[:limit]
|
||||
print(f" (limited to {limit} for testing)")
|
||||
|
||||
# Step 2: Crawl each profile
|
||||
for i, url in enumerate(urls):
|
||||
slug = url.rstrip("/").split("/")[-1]
|
||||
|
||||
if (i + 1) % 50 == 0 or i == 0:
|
||||
print(f" Crawling {i+1}/{len(urls)}: {slug}")
|
||||
|
||||
listing_data = crawl_profile(url)
|
||||
found += 1
|
||||
|
||||
if not listing_data:
|
||||
errors += 1
|
||||
continue
|
||||
|
||||
source_id = slug
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id, url, listing_data, log_id
|
||||
)
|
||||
|
||||
if row_id:
|
||||
normalized = to_normalized(listing_data, url)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
db.commit() # periodic commit
|
||||
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, "
|
||||
f"{skipped} skipped, {errors} errors")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = int(sys.argv[1]) if len(sys.argv) > 1 else None
|
||||
run(limit=limit)
|
||||
163
crawlers/crawl_nfda.py
Normal file
163
crawlers/crawl_nfda.py
Normal file
@@ -0,0 +1,163 @@
|
||||
"""Crawler for the NFDA (National Funeral Directors Association) directory.
|
||||
|
||||
Source: https://nfda.com.au/find-your-local-nfda-member/
|
||||
Method: WPSL JSON API (GET requests with lat/lng search)
|
||||
Fields: name, address, city, state, postcode, lat/lng, phone, email
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_json, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, normalize_state,
|
||||
generate_slug, to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "nfda"
|
||||
API_URL = "https://nfda.com.au/wp-admin/admin-ajax.php"
|
||||
|
||||
# Search centroids covering Australia with large radius
|
||||
SEARCH_POINTS = [
|
||||
{"name": "Sydney", "lat": -33.87, "lng": 151.21},
|
||||
{"name": "Melbourne", "lat": -37.81, "lng": 144.96},
|
||||
{"name": "Brisbane", "lat": -27.47, "lng": 153.03},
|
||||
{"name": "Perth", "lat": -31.95, "lng": 115.86},
|
||||
{"name": "Adelaide", "lat": -34.93, "lng": 138.60},
|
||||
{"name": "Hobart", "lat": -42.88, "lng": 147.33},
|
||||
{"name": "Darwin", "lat": -12.46, "lng": 130.85},
|
||||
{"name": "Townsville", "lat": -19.26, "lng": 146.82},
|
||||
{"name": "Central NSW", "lat": -30.0, "lng": 150.0},
|
||||
{"name": "Inland QLD", "lat": -23.0, "lng": 145.0},
|
||||
]
|
||||
|
||||
|
||||
def fetch_members(lat: float, lng: float, max_results: int = 50,
|
||||
radius: int = 5000) -> list[dict]:
|
||||
"""Fetch NFDA members near a given lat/lng."""
|
||||
params = {
|
||||
"action": "store_search",
|
||||
"lat": str(lat),
|
||||
"lng": str(lng),
|
||||
"max_results": str(max_results),
|
||||
"search_radius": str(radius),
|
||||
"autoload": "1",
|
||||
}
|
||||
data = fetch_json(API_URL, method="GET", data=params)
|
||||
if isinstance(data, list):
|
||||
return data
|
||||
return []
|
||||
|
||||
|
||||
def to_normalized(record: dict) -> dict:
|
||||
"""Convert an NFDA record to intermediate format."""
|
||||
state = normalize_state(record.get("state", ""))
|
||||
|
||||
business = {
|
||||
"name": record.get("store", "").strip(),
|
||||
"abn": None,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
"email": record.get("email", "").strip() or None,
|
||||
"website": record.get("url", "").strip() or None,
|
||||
"description": None,
|
||||
}
|
||||
|
||||
lat_val = record.get("lat")
|
||||
lng_val = record.get("lng")
|
||||
try:
|
||||
lat_val = float(lat_val) if lat_val else None
|
||||
lng_val = float(lng_val) if lng_val else None
|
||||
except (ValueError, TypeError):
|
||||
lat_val = lng_val = None
|
||||
|
||||
city = record.get("city", "").strip()
|
||||
# Normalize city casing (some are ALL CAPS)
|
||||
if city and city == city.upper():
|
||||
city = city.title()
|
||||
|
||||
locations = [{
|
||||
"address": record.get("address", "").strip(),
|
||||
"suburb": city,
|
||||
"state": state,
|
||||
"postcode": record.get("zip", "").strip(),
|
||||
"lat": lat_val,
|
||||
"lng": lng_val,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
}]
|
||||
|
||||
source_id = str(record.get("id", ""))
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url="https://nfda.com.au/find-your-local-nfda-member/",
|
||||
business=business,
|
||||
locations=locations,
|
||||
)
|
||||
|
||||
|
||||
def run():
|
||||
"""Run the full NFDA crawl."""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
seen_ids = set()
|
||||
all_records = []
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
|
||||
try:
|
||||
for point in SEARCH_POINTS:
|
||||
print(f" Searching near {point['name']}...", end=" ", flush=True)
|
||||
members = fetch_members(point["lat"], point["lng"])
|
||||
new_count = 0
|
||||
|
||||
for member in members:
|
||||
member_id = str(member.get("id", ""))
|
||||
if member_id in seen_ids:
|
||||
continue
|
||||
seen_ids.add(member_id)
|
||||
all_records.append(member)
|
||||
new_count += 1
|
||||
|
||||
print(f"{len(members)} results, {new_count} new unique")
|
||||
found += len(members)
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
print(f" Total unique members: {len(all_records)}")
|
||||
|
||||
# Store records
|
||||
for record in all_records:
|
||||
source_id = str(record.get("id", ""))
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id,
|
||||
"https://nfda.com.au/find-your-local-nfda-member/",
|
||||
record, log_id
|
||||
)
|
||||
if row_id:
|
||||
normalized = to_normalized(record)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
return all_records
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
220
crawlers/crawl_vic_register.py
Normal file
220
crawlers/crawl_vic_register.py
Normal file
@@ -0,0 +1,220 @@
|
||||
"""Crawler for the VIC Consumer Affairs Public Register of Funeral Providers.
|
||||
|
||||
Source: https://registers.consumer.vic.gov.au/fpsearch
|
||||
Method: HTTP GET per letter A-Z, parse HTML tables
|
||||
Fields: name, place of business, postcode, postal address, phone
|
||||
"""
|
||||
|
||||
import re
|
||||
import time
|
||||
import json
|
||||
import html.parser
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, generate_slug,
|
||||
to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "vic_register"
|
||||
BASE_URL = "https://registers.consumer.vic.gov.au/FpSearch/PerformSearch"
|
||||
|
||||
|
||||
class VICTableParser(html.parser.HTMLParser):
|
||||
"""Parse the VIC register HTML table into records."""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.records = []
|
||||
self._in_table = False
|
||||
self._in_tbody = False
|
||||
self._in_row = False
|
||||
self._in_cell = False
|
||||
self._current_row = []
|
||||
self._current_cell = ""
|
||||
|
||||
def handle_starttag(self, tag, attrs):
|
||||
if tag == "table":
|
||||
self._in_table = True
|
||||
elif tag == "tbody" and self._in_table:
|
||||
self._in_tbody = True
|
||||
elif tag == "tr" and self._in_tbody:
|
||||
self._in_row = True
|
||||
self._current_row = []
|
||||
elif tag == "td" and self._in_row:
|
||||
self._in_cell = True
|
||||
self._current_cell = ""
|
||||
|
||||
def handle_endtag(self, tag):
|
||||
if tag == "td" and self._in_cell:
|
||||
self._in_cell = False
|
||||
self._current_row.append(self._current_cell.strip())
|
||||
elif tag == "tr" and self._in_row:
|
||||
self._in_row = False
|
||||
if len(self._current_row) >= 4:
|
||||
self.records.append(self._current_row)
|
||||
elif tag == "tbody":
|
||||
self._in_tbody = False
|
||||
elif tag == "table":
|
||||
self._in_table = False
|
||||
|
||||
def handle_data(self, data):
|
||||
if self._in_cell:
|
||||
self._current_cell += data
|
||||
|
||||
|
||||
def parse_address(place_of_business: str) -> dict:
|
||||
"""Parse a VIC register address into components."""
|
||||
parts = place_of_business.strip()
|
||||
# Try to extract postcode from the end
|
||||
postcode_match = re.search(r'\b(\d{4})\s*$', parts)
|
||||
postcode = postcode_match.group(1) if postcode_match else None
|
||||
|
||||
# Try to extract suburb (usually the last word(s) before postcode)
|
||||
suburb = None
|
||||
if postcode:
|
||||
before_postcode = parts[:postcode_match.start()].strip().rstrip(",").strip()
|
||||
# Last segment after comma is usually suburb
|
||||
if "," in before_postcode:
|
||||
suburb = before_postcode.split(",")[-1].strip()
|
||||
else:
|
||||
# Take last 1-2 words as suburb
|
||||
words = before_postcode.split()
|
||||
if len(words) >= 2:
|
||||
suburb = " ".join(words[-2:]) if words[-1][0].isupper() else words[-1]
|
||||
|
||||
return {
|
||||
"address": parts,
|
||||
"suburb": suburb,
|
||||
"state": "VIC",
|
||||
"postcode": postcode,
|
||||
}
|
||||
|
||||
|
||||
def crawl_letter(letter: str) -> list[dict]:
|
||||
"""Crawl all records for a single letter."""
|
||||
url = f"{BASE_URL}?Letter={letter}"
|
||||
html_text = fetch_url(url)
|
||||
|
||||
parser = VICTableParser()
|
||||
parser.feed(html_text)
|
||||
|
||||
records = []
|
||||
for row in parser.records:
|
||||
# Columns: Name, Place of Business, Postcode, Postal Address, Phone
|
||||
name = row[0] if len(row) > 0 else ""
|
||||
place = row[1] if len(row) > 1 else ""
|
||||
postcode = row[2] if len(row) > 2 else ""
|
||||
postal = row[3] if len(row) > 3 else ""
|
||||
phone = row[4] if len(row) > 4 else ""
|
||||
|
||||
if not name:
|
||||
continue
|
||||
|
||||
records.append({
|
||||
"name": name.strip(),
|
||||
"place_of_business": place.strip(),
|
||||
"postcode": postcode.strip(),
|
||||
"postal_address": postal.strip(),
|
||||
"phone": phone.strip(),
|
||||
})
|
||||
|
||||
return records
|
||||
|
||||
|
||||
def make_source_id(record: dict) -> str:
|
||||
"""Create a stable source ID from name + address."""
|
||||
name = record["name"].lower().strip()
|
||||
addr = record["place_of_business"].lower().strip()
|
||||
return f"{generate_slug(name)}_{record['postcode']}"
|
||||
|
||||
|
||||
def to_normalized(record: dict) -> dict:
|
||||
"""Convert a VIC register record to intermediate format."""
|
||||
addr = parse_address(record["place_of_business"])
|
||||
|
||||
business = {
|
||||
"name": record["name"],
|
||||
"abn": None,
|
||||
"phone": normalize_phone(record["phone"]),
|
||||
"email": None,
|
||||
"website": None,
|
||||
"description": None,
|
||||
}
|
||||
|
||||
locations = [{
|
||||
"address": record["place_of_business"],
|
||||
"suburb": addr["suburb"],
|
||||
"state": "VIC",
|
||||
"postcode": record["postcode"] or addr["postcode"],
|
||||
"lat": None,
|
||||
"lng": None,
|
||||
"phone": normalize_phone(record["phone"]),
|
||||
}]
|
||||
|
||||
source_id = make_source_id(record)
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url=f"{BASE_URL}?Letter={record['name'][0].upper()}",
|
||||
business=business,
|
||||
locations=locations,
|
||||
)
|
||||
|
||||
|
||||
def run():
|
||||
"""Run the full VIC register crawl."""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
all_records = []
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
|
||||
try:
|
||||
for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
|
||||
print(f" Crawling letter {letter}...", end=" ", flush=True)
|
||||
records = crawl_letter(letter)
|
||||
print(f"{len(records)} records")
|
||||
all_records.extend(records)
|
||||
found += len(records)
|
||||
|
||||
if letter != "Z":
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
# Store and normalize
|
||||
for record in all_records:
|
||||
source_id = make_source_id(record)
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id,
|
||||
f"{BASE_URL}?Letter={record['name'][0].upper()}",
|
||||
record, log_id
|
||||
)
|
||||
if row_id:
|
||||
normalized = to_normalized(record)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
return all_records
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
425
crawlers/dedup.py
Normal file
425
crawlers/dedup.py
Normal file
@@ -0,0 +1,425 @@
|
||||
"""Deduplication and merge engine.
|
||||
|
||||
Processes source_records → funeral_brand + location + package entries.
|
||||
Handles cross-source matching and field-level merging.
|
||||
|
||||
Matching hierarchy (strongest to weakest):
|
||||
1. source_key match — same record from same source (skip/update)
|
||||
2. ABN match — same business entity
|
||||
3. Name + Postcode exact match — likely same business
|
||||
4. Fuzzy name match (>85%) + same state — probable match, flag for review
|
||||
|
||||
Merge priority (higher = preferred):
|
||||
vic_register > funerals_australia > nfda > gathered_here
|
||||
|
||||
Never overwrite verified provider data.
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
import sqlite3
|
||||
from difflib import SequenceMatcher
|
||||
|
||||
from base import get_db, generate_slug, normalize_state
|
||||
|
||||
# Source priority for merge conflicts (higher number = more authoritative)
|
||||
SOURCE_PRIORITY = {
|
||||
"vic_register": 40,
|
||||
"funerals_australia": 30,
|
||||
"nfda": 20,
|
||||
"gathered_here": 10,
|
||||
}
|
||||
|
||||
|
||||
def normalize_name(name: str) -> str:
|
||||
"""Normalize a business name for comparison."""
|
||||
name = name.strip().upper()
|
||||
# Remove common suffixes
|
||||
for suffix in [" PTY LTD", " PTY. LTD.", " P/L", " LIMITED",
|
||||
" PROPRIETARY LIMITED", " INC", " LLC",
|
||||
" FUNERAL DIRECTORS", " FUNERAL SERVICES",
|
||||
" FUNERALS", " FUNERAL HOME"]:
|
||||
name = name.removesuffix(suffix)
|
||||
# Remove punctuation
|
||||
name = re.sub(r"[''`\".,&()-]", " ", name)
|
||||
name = re.sub(r"\s+", " ", name).strip()
|
||||
return name
|
||||
|
||||
|
||||
def fuzzy_match(name1: str, name2: str) -> float:
|
||||
"""Return similarity ratio between two names (0.0 to 1.0)."""
|
||||
n1 = normalize_name(name1)
|
||||
n2 = normalize_name(name2)
|
||||
return SequenceMatcher(None, n1, n2).ratio()
|
||||
|
||||
|
||||
def find_existing_brand(db: sqlite3.Connection, record: dict) -> tuple[int | None, str]:
|
||||
"""Find a matching funeral_brand for a source record.
|
||||
|
||||
Returns (brand_id, match_type) or (None, 'new').
|
||||
"""
|
||||
biz = record.get("business", {})
|
||||
locs = record.get("locations", [])
|
||||
name = biz.get("name", "")
|
||||
abn = biz.get("abn")
|
||||
source = record.get("source", "")
|
||||
source_id = record.get("sourceId", "")
|
||||
source_key = f"{source}:{source_id}"
|
||||
|
||||
postcode = None
|
||||
state = None
|
||||
if locs:
|
||||
postcode = locs[0].get("postcode")
|
||||
state = locs[0].get("state")
|
||||
|
||||
# 1. Source key match (exact same record from same source)
|
||||
row = db.execute(
|
||||
"SELECT id FROM funeral_brand WHERE source_key = ?",
|
||||
(source_key,)
|
||||
).fetchone()
|
||||
if row:
|
||||
return row["id"], "source_key"
|
||||
|
||||
# 2. ABN match
|
||||
if abn:
|
||||
row = db.execute(
|
||||
"SELECT id FROM funeral_brand WHERE abn = ?",
|
||||
(abn,)
|
||||
).fetchone()
|
||||
if row:
|
||||
return row["id"], "abn"
|
||||
|
||||
# 3. Exact name + postcode match
|
||||
if name and postcode:
|
||||
norm = normalize_name(name)
|
||||
# Check all brands — need fuzzy on name
|
||||
rows = db.execute(
|
||||
"SELECT id, title FROM funeral_brand WHERE business_postcode = ?",
|
||||
(postcode,)
|
||||
).fetchall()
|
||||
for row in rows:
|
||||
if normalize_name(row["title"]) == norm:
|
||||
return row["id"], "name_postcode"
|
||||
|
||||
# 4. Fuzzy name + same state
|
||||
if name and state:
|
||||
rows = db.execute(
|
||||
"SELECT id, title FROM funeral_brand WHERE business_state = ?",
|
||||
(state,)
|
||||
).fetchall()
|
||||
for row in rows:
|
||||
score = fuzzy_match(name, row["title"])
|
||||
if score >= 0.85:
|
||||
return row["id"], "fuzzy"
|
||||
|
||||
return None, "new"
|
||||
|
||||
|
||||
def merge_field(existing: str | None, new_val: str | None,
|
||||
existing_priority: int, new_priority: int) -> str | None:
|
||||
"""Merge a single field, preferring non-null and higher-priority."""
|
||||
if not new_val:
|
||||
return existing
|
||||
if not existing:
|
||||
return new_val
|
||||
# Both have values — prefer higher priority source
|
||||
if new_priority > existing_priority:
|
||||
return new_val
|
||||
return existing
|
||||
|
||||
|
||||
def create_brand(db: sqlite3.Connection, record: dict) -> int:
|
||||
"""Create a new funeral_brand from a source record."""
|
||||
biz = record.get("business", {})
|
||||
locs = record.get("locations", [])
|
||||
source = record.get("source", "")
|
||||
source_id = record.get("sourceId", "")
|
||||
source_key = f"{source}:{source_id}"
|
||||
|
||||
loc = locs[0] if locs else {}
|
||||
slug = generate_slug(biz.get("name", "unknown"))
|
||||
|
||||
# Ensure unique slug
|
||||
base_slug = slug
|
||||
counter = 1
|
||||
while True:
|
||||
existing = db.execute(
|
||||
"SELECT id FROM funeral_brand WHERE code = ?", (slug,)
|
||||
).fetchone()
|
||||
if not existing:
|
||||
break
|
||||
slug = f"{base_slug}-{counter}"
|
||||
counter += 1
|
||||
|
||||
cur = db.execute(
|
||||
"""INSERT INTO funeral_brand (
|
||||
title, description, email, phone, website, abn, code,
|
||||
hidden, verified, source_key, source_url, enrichment_status,
|
||||
business_address, business_suburb, business_state, business_postcode
|
||||
) VALUES (?, ?, ?, ?, ?, ?, ?, 1, 0, ?, ?, 'pending', ?, ?, ?, ?)""",
|
||||
(
|
||||
biz.get("name"),
|
||||
biz.get("description"),
|
||||
biz.get("email"),
|
||||
biz.get("phone"),
|
||||
biz.get("website"),
|
||||
biz.get("abn"),
|
||||
slug,
|
||||
source_key,
|
||||
record.get("sourceUrl"),
|
||||
loc.get("address"),
|
||||
loc.get("suburb"),
|
||||
loc.get("state"),
|
||||
loc.get("postcode"),
|
||||
)
|
||||
)
|
||||
brand_id = cur.lastrowid
|
||||
|
||||
# Create locations
|
||||
for loc_data in locs:
|
||||
title_parts = [loc_data.get("suburb", ""), loc_data.get("state", "")]
|
||||
loc_title = ", ".join(p for p in title_parts if p) or biz.get("name", "")
|
||||
|
||||
db.execute(
|
||||
"""INSERT INTO location (
|
||||
title, address, suburb, state, postcode, lat, lng, brand_id
|
||||
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(
|
||||
loc_title,
|
||||
loc_data.get("address"),
|
||||
loc_data.get("suburb"),
|
||||
loc_data.get("state"),
|
||||
loc_data.get("postcode"),
|
||||
loc_data.get("lat"),
|
||||
loc_data.get("lng"),
|
||||
brand_id,
|
||||
)
|
||||
)
|
||||
|
||||
# Create packages (from Gathered Here pricing)
|
||||
packages = record.get("packages", [])
|
||||
for pkg in packages:
|
||||
if not pkg.get("price"):
|
||||
continue
|
||||
cur = db.execute(
|
||||
"""INSERT INTO package (
|
||||
title, funeral_type, brand_id, source_url, extraction_confidence
|
||||
) VALUES (?, ?, ?, ?, ?)""",
|
||||
(
|
||||
pkg.get("name"),
|
||||
pkg.get("funeralType"),
|
||||
brand_id,
|
||||
record.get("sourceUrl"),
|
||||
0.8, # Gathered Here pricing is structured, fairly reliable
|
||||
)
|
||||
)
|
||||
pkg_id = cur.lastrowid
|
||||
|
||||
# Create inclusions if available
|
||||
for inc in pkg.get("inclusions", []):
|
||||
db.execute(
|
||||
"""INSERT INTO package_inclusion (
|
||||
price, optional, complimentary, inclusion_type_title, package_id
|
||||
) VALUES (?, ?, ?, ?, ?)""",
|
||||
(
|
||||
inc.get("price", 0),
|
||||
1 if inc.get("optional") else 0,
|
||||
1 if inc.get("complimentary") else 0,
|
||||
inc.get("item", "Unknown"),
|
||||
pkg_id,
|
||||
)
|
||||
)
|
||||
|
||||
return brand_id
|
||||
|
||||
|
||||
def update_brand(db: sqlite3.Connection, brand_id: int,
|
||||
record: dict, match_type: str) -> bool:
|
||||
"""Merge new data into an existing brand. Returns True if updated."""
|
||||
biz = record.get("business", {})
|
||||
locs = record.get("locations", [])
|
||||
source = record.get("source", "")
|
||||
new_priority = SOURCE_PRIORITY.get(source, 0)
|
||||
|
||||
# Never overwrite verified providers
|
||||
brand = db.execute(
|
||||
"SELECT * FROM funeral_brand WHERE id = ?", (brand_id,)
|
||||
).fetchone()
|
||||
if brand["verified"]:
|
||||
return False
|
||||
|
||||
# Determine existing source priority
|
||||
existing_source = ""
|
||||
if brand["source_key"]:
|
||||
existing_source = brand["source_key"].split(":")[0]
|
||||
existing_priority = SOURCE_PRIORITY.get(existing_source, 0)
|
||||
|
||||
# Field-level merge — only fill blanks or upgrade from higher priority
|
||||
updates = {}
|
||||
field_map = {
|
||||
"description": biz.get("description"),
|
||||
"email": biz.get("email"),
|
||||
"phone": biz.get("phone"),
|
||||
"website": biz.get("website"),
|
||||
"abn": biz.get("abn"),
|
||||
}
|
||||
|
||||
for field, new_val in field_map.items():
|
||||
merged = merge_field(brand[field], new_val, existing_priority, new_priority)
|
||||
if merged != brand[field]:
|
||||
updates[field] = merged
|
||||
|
||||
# Update location data if we have coords and existing doesn't
|
||||
if locs:
|
||||
loc = locs[0]
|
||||
existing_locs = db.execute(
|
||||
"SELECT * FROM location WHERE brand_id = ?", (brand_id,)
|
||||
).fetchall()
|
||||
|
||||
if not existing_locs and loc.get("suburb"):
|
||||
title_parts = [loc.get("suburb", ""), loc.get("state", "")]
|
||||
loc_title = ", ".join(p for p in title_parts if p)
|
||||
db.execute(
|
||||
"""INSERT INTO location (
|
||||
title, address, suburb, state, postcode, lat, lng, brand_id
|
||||
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(
|
||||
loc_title, loc.get("address"), loc.get("suburb"),
|
||||
loc.get("state"), loc.get("postcode"),
|
||||
loc.get("lat"), loc.get("lng"), brand_id,
|
||||
)
|
||||
)
|
||||
elif existing_locs:
|
||||
# Update first location with coords if missing
|
||||
eloc = existing_locs[0]
|
||||
if not eloc["lat"] and loc.get("lat"):
|
||||
db.execute(
|
||||
"UPDATE location SET lat = ?, lng = ? WHERE id = ?",
|
||||
(loc.get("lat"), loc.get("lng"), eloc["id"])
|
||||
)
|
||||
|
||||
# Add packages if we have them and brand doesn't yet
|
||||
packages = record.get("packages", [])
|
||||
if packages:
|
||||
existing_pkgs = db.execute(
|
||||
"SELECT COUNT(*) as n FROM package WHERE brand_id = ?", (brand_id,)
|
||||
).fetchone()["n"]
|
||||
|
||||
if existing_pkgs == 0:
|
||||
for pkg in packages:
|
||||
if not pkg.get("price"):
|
||||
continue
|
||||
cur = db.execute(
|
||||
"""INSERT INTO package (
|
||||
title, funeral_type, brand_id, source_url
|
||||
) VALUES (?, ?, ?, ?)""",
|
||||
(pkg.get("name"), pkg.get("funeralType"),
|
||||
brand_id, record.get("sourceUrl"))
|
||||
)
|
||||
|
||||
if updates:
|
||||
set_clause = ", ".join(f"{k} = ?" for k in updates)
|
||||
values = list(updates.values()) + [brand_id]
|
||||
db.execute(
|
||||
f"UPDATE funeral_brand SET {set_clause}, updated_at = datetime('now') WHERE id = ?",
|
||||
values
|
||||
)
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def process_all():
|
||||
"""Process all source_records through deduplication and create brand entries.
|
||||
|
||||
Order matters: process higher-priority sources first so their data
|
||||
forms the base record that lower-priority sources merge into.
|
||||
"""
|
||||
db = get_db()
|
||||
|
||||
# Process in priority order (highest first)
|
||||
sources_ordered = sorted(SOURCE_PRIORITY.keys(),
|
||||
key=lambda s: SOURCE_PRIORITY[s], reverse=True)
|
||||
|
||||
stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
|
||||
|
||||
print("=" * 60)
|
||||
print("DEDUPLICATION ENGINE")
|
||||
print("=" * 60)
|
||||
|
||||
for source in sources_ordered:
|
||||
records = db.execute(
|
||||
"""SELECT id, normalized_data FROM source_record
|
||||
WHERE source_name = ? AND normalized_data IS NOT NULL""",
|
||||
(source,)
|
||||
).fetchall()
|
||||
|
||||
if not records:
|
||||
continue
|
||||
|
||||
print(f"\n Processing {source}: {len(records)} records")
|
||||
source_stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
|
||||
|
||||
for row in records:
|
||||
record = json.loads(row["normalized_data"])
|
||||
brand_id, match_type = find_existing_brand(db, record)
|
||||
|
||||
if match_type == "new":
|
||||
brand_id = create_brand(db, record)
|
||||
source_stats["new"] += 1
|
||||
elif match_type == "source_key":
|
||||
source_stats["skipped"] += 1
|
||||
else:
|
||||
# Matched to existing — merge
|
||||
updated = update_brand(db, brand_id, record, match_type)
|
||||
if updated:
|
||||
source_stats["updated"] += 1
|
||||
else:
|
||||
source_stats["matched"] += 1
|
||||
|
||||
# Update source_record with match info
|
||||
db.execute(
|
||||
"""UPDATE source_record
|
||||
SET matched_brand_id = ?, match_type = ?, processed_at = datetime('now')
|
||||
WHERE id = ?""",
|
||||
(brand_id, match_type, row["id"])
|
||||
)
|
||||
|
||||
db.commit()
|
||||
print(f" New: {source_stats['new']}, Updated: {source_stats['updated']}, "
|
||||
f"Matched: {source_stats['matched']}, Skipped: {source_stats['skipped']}")
|
||||
|
||||
for k, v in source_stats.items():
|
||||
stats[k] += v
|
||||
|
||||
# Final summary
|
||||
total_brands = db.execute("SELECT COUNT(*) as n FROM funeral_brand").fetchone()["n"]
|
||||
total_locations = db.execute("SELECT COUNT(*) as n FROM location").fetchone()["n"]
|
||||
total_packages = db.execute("SELECT COUNT(*) as n FROM package").fetchone()["n"]
|
||||
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"DEDUP RESULTS")
|
||||
print(f"{'=' * 60}")
|
||||
print(f" New brands created: {stats['new']}")
|
||||
print(f" Existing updated: {stats['updated']}")
|
||||
print(f" Matched (no change): {stats['matched']}")
|
||||
print(f" Skipped (source_key): {stats['skipped']}")
|
||||
print(f"\n Total brands in DB: {total_brands}")
|
||||
print(f" Total locations in DB: {total_locations}")
|
||||
print(f" Total packages in DB: {total_packages}")
|
||||
|
||||
# Show match type breakdown
|
||||
print(f"\n Match type breakdown:")
|
||||
rows = db.execute(
|
||||
"""SELECT match_type, COUNT(*) as n
|
||||
FROM source_record WHERE processed_at IS NOT NULL
|
||||
GROUP BY match_type ORDER BY n DESC"""
|
||||
).fetchall()
|
||||
for row in rows:
|
||||
print(f" {row['match_type']:15s} {row['n']:5d}")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
process_all()
|
||||
320
crawlers/discover_websites.py
Normal file
320
crawlers/discover_websites.py
Normal file
@@ -0,0 +1,320 @@
|
||||
"""Website discovery module.
|
||||
|
||||
For each provider without a website URL, attempts to find their website
|
||||
using multiple strategies (tried in order):
|
||||
|
||||
1. Serper.dev (2,500 free Google searches, no CC needed)
|
||||
2. DuckDuckGo lite (free fallback, rate-limited)
|
||||
3. URL pattern guessing (businessname.com.au)
|
||||
|
||||
Also validates discovered URLs to confirm they belong to the business.
|
||||
|
||||
Configuration:
|
||||
Set SERPER_API_KEY env var or in config.json to enable Serper.dev.
|
||||
Without it, falls back to DuckDuckGo.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, normalize_phone, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
# Load Serper API key from env or config
|
||||
SERPER_API_KEY = os.environ.get("SERPER_API_KEY")
|
||||
if not SERPER_API_KEY:
|
||||
config_path = Path(__file__).parent / "config.json"
|
||||
if config_path.exists():
|
||||
with open(config_path) as f:
|
||||
config = json.load(f)
|
||||
SERPER_API_KEY = config.get("serper_api_key")
|
||||
|
||||
# Domains to skip when extracting search results
|
||||
SKIP_DOMAINS = [
|
||||
"yellowpages", "whitepages", "truelocal", "yelp", "cylex",
|
||||
"australia247", "showmelocal", "hotfrog", "localsearch",
|
||||
"facebook.com", "linkedin.com", "instagram.com", "twitter.com",
|
||||
"gatheredhere", "ezifunerals", "funeralocity", "funeraldirectory",
|
||||
"deathsandfunerals", "mytributes", "obits.com",
|
||||
"duckduckgo.com", "google.com", "bing.com",
|
||||
"nfda.com.au", "funeralsaustralia.org",
|
||||
"wikipedia.org", "youtube.com",
|
||||
]
|
||||
|
||||
|
||||
def search_serper(query: str) -> list[str]:
|
||||
"""Search via Serper.dev (Google results as JSON). 2,500 free queries."""
|
||||
if not SERPER_API_KEY:
|
||||
return []
|
||||
|
||||
url = "https://google.serper.dev/search"
|
||||
data = json.dumps({"q": query, "gl": "au", "num": 10}).encode("utf-8")
|
||||
req = urllib.request.Request(url, data=data, headers={
|
||||
"X-API-KEY": SERPER_API_KEY,
|
||||
"Content-Type": "application/json",
|
||||
})
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as resp:
|
||||
result = json.loads(resp.read().decode("utf-8"))
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
results = []
|
||||
for item in result.get("organic", []):
|
||||
link = item.get("link", "")
|
||||
if not link:
|
||||
continue
|
||||
if any(d in link.lower() for d in SKIP_DOMAINS):
|
||||
continue
|
||||
results.append(link)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def search_ddg(query: str) -> list[str]:
|
||||
"""Search DuckDuckGo lite and return result URLs (filtered)."""
|
||||
encoded = urllib.parse.quote(query)
|
||||
url = f"https://lite.duckduckgo.com/lite/?q={encoded}"
|
||||
|
||||
try:
|
||||
html = fetch_url(url)
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
# Extract redirect URLs from DDG lite format
|
||||
raw_links = re.findall(
|
||||
r'href="//duckduckgo\.com/l/\?uddg=([^&"]+)', html
|
||||
)
|
||||
|
||||
results = []
|
||||
for link in raw_links:
|
||||
decoded = urllib.parse.unquote(link)
|
||||
# Skip ads
|
||||
if "ad_domain" in decoded or "ad_provider" in decoded:
|
||||
continue
|
||||
# Skip directory/aggregator sites
|
||||
if any(d in decoded.lower() for d in SKIP_DOMAINS):
|
||||
continue
|
||||
results.append(decoded)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def validate_url(url: str, business_name: str) -> dict:
|
||||
"""Validate that a URL is a real website belonging to this business.
|
||||
|
||||
Returns: {valid: bool, confidence: str, reason: str}
|
||||
"""
|
||||
try:
|
||||
html = fetch_url(url, timeout=15)
|
||||
except urllib.error.HTTPError as e:
|
||||
return {"valid": False, "confidence": "none", "reason": f"HTTP {e.code}"}
|
||||
except Exception as e:
|
||||
return {"valid": False, "confidence": "none", "reason": str(e)[:100]}
|
||||
|
||||
html_lower = html.lower()
|
||||
|
||||
# Check if it's a parked/for-sale domain
|
||||
parked_signals = ["domain is for sale", "buy this domain",
|
||||
"parked domain", "this domain", "godaddy",
|
||||
"domain parking"]
|
||||
if any(s in html_lower for s in parked_signals):
|
||||
return {"valid": False, "confidence": "none", "reason": "parked domain"}
|
||||
|
||||
# Check if the page mentions the business name
|
||||
name_parts = business_name.lower().split()
|
||||
# Require at least 2 name parts to match (or all if name is 1-2 words)
|
||||
min_matches = min(2, len(name_parts))
|
||||
matches = sum(1 for part in name_parts
|
||||
if len(part) > 2 and part in html_lower)
|
||||
|
||||
if matches >= min_matches:
|
||||
return {"valid": True, "confidence": "confirmed", "reason": "name found in page"}
|
||||
|
||||
# Check title tag
|
||||
title_match = re.search(r"<title[^>]*>(.*?)</title>", html, re.IGNORECASE | re.DOTALL)
|
||||
if title_match:
|
||||
title = title_match.group(1).lower()
|
||||
if any(part in title for part in name_parts if len(part) > 2):
|
||||
return {"valid": True, "confidence": "probable",
|
||||
"reason": "partial name in title"}
|
||||
|
||||
# Check for funeral-related content (it's at least a funeral business)
|
||||
funeral_signals = ["funeral", "cremation", "burial", "memorial",
|
||||
"chapel", "obituar", "condolence"]
|
||||
if any(s in html_lower for s in funeral_signals):
|
||||
return {"valid": True, "confidence": "probable",
|
||||
"reason": "funeral content found, name not confirmed"}
|
||||
|
||||
return {"valid": False, "confidence": "low",
|
||||
"reason": "business name not found on page"}
|
||||
|
||||
|
||||
def guess_urls(business_name: str) -> list[str]:
|
||||
"""Generate candidate URLs from a business name."""
|
||||
# Clean name for domain guessing
|
||||
slug = business_name.lower().strip()
|
||||
slug = re.sub(r"[''`]", "", slug)
|
||||
slug = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug)
|
||||
slug = re.sub(r"[^a-z0-9]+", "", slug)
|
||||
|
||||
# Also try hyphenated version
|
||||
slug_hyphen = business_name.lower().strip()
|
||||
slug_hyphen = re.sub(r"[''`]", "", slug_hyphen)
|
||||
slug_hyphen = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug_hyphen)
|
||||
slug_hyphen = re.sub(r"[^a-z0-9]+", "-", slug_hyphen).strip("-")
|
||||
|
||||
candidates = []
|
||||
for s in [slug, slug_hyphen]:
|
||||
if s:
|
||||
candidates.append(f"https://www.{s}.com.au")
|
||||
candidates.append(f"https://{s}.com.au")
|
||||
|
||||
return candidates
|
||||
|
||||
|
||||
def discover_website(name: str, suburb: str | None, state: str | None,
|
||||
phone: str | None = None) -> dict | None:
|
||||
"""Attempt to discover a business website.
|
||||
|
||||
Returns: {url, confidence, method, validation} or None.
|
||||
"""
|
||||
# Build search query
|
||||
query_parts = [name]
|
||||
if suburb:
|
||||
query_parts.append(suburb)
|
||||
if state:
|
||||
query_parts.append(state)
|
||||
query = " ".join(query_parts)
|
||||
|
||||
# Strategy 1: Serper.dev (Google results, 2500 free)
|
||||
results = search_serper(query)
|
||||
|
||||
# Strategy 2: DuckDuckGo fallback
|
||||
if not results:
|
||||
results = search_ddg(query)
|
||||
|
||||
for url in results[:3]:
|
||||
validation = validate_url(url, name)
|
||||
if validation["valid"]:
|
||||
return {
|
||||
"url": url.rstrip("/"),
|
||||
"confidence": validation["confidence"],
|
||||
"method": "search",
|
||||
"validation": validation,
|
||||
}
|
||||
time.sleep(0.5)
|
||||
|
||||
# Strategy 2: URL guessing
|
||||
candidates = guess_urls(name)
|
||||
for url in candidates:
|
||||
try:
|
||||
validation = validate_url(url, name)
|
||||
if validation["valid"]:
|
||||
return {
|
||||
"url": url.rstrip("/"),
|
||||
"confidence": validation["confidence"],
|
||||
"method": "guess",
|
||||
"validation": validation,
|
||||
}
|
||||
except Exception:
|
||||
continue
|
||||
time.sleep(0.3)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def run(limit: int | None = None, state_filter: str | None = None):
|
||||
"""Discover websites for all providers without one.
|
||||
|
||||
Args:
|
||||
limit: Max providers to process (for testing).
|
||||
state_filter: Only process providers in this state.
|
||||
"""
|
||||
db = get_db()
|
||||
|
||||
query = """
|
||||
SELECT id, title, business_suburb, business_state, phone
|
||||
FROM funeral_brand
|
||||
WHERE website IS NULL AND verified = 0
|
||||
"""
|
||||
params = []
|
||||
|
||||
if state_filter:
|
||||
query += " AND business_state = ?"
|
||||
params.append(state_filter)
|
||||
|
||||
query += " ORDER BY id"
|
||||
|
||||
if limit:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
providers = db.execute(query, params).fetchall()
|
||||
print(f"Providers without websites: {len(providers)}")
|
||||
|
||||
found = 0
|
||||
not_found = 0
|
||||
|
||||
for i, prov in enumerate(providers):
|
||||
name = prov["title"]
|
||||
suburb = prov["business_suburb"]
|
||||
state = prov["business_state"]
|
||||
phone = prov["phone"]
|
||||
|
||||
if (i + 1) % 10 == 0 or i == 0:
|
||||
print(f" [{i+1}/{len(providers)}] Processing: {name}")
|
||||
|
||||
result = discover_website(name, suburb, state, phone)
|
||||
|
||||
if result:
|
||||
db.execute(
|
||||
"""UPDATE funeral_brand
|
||||
SET website = ?, updated_at = datetime('now')
|
||||
WHERE id = ?""",
|
||||
(result["url"], prov["id"])
|
||||
)
|
||||
found += 1
|
||||
if (i + 1) <= 20 or result["confidence"] == "confirmed":
|
||||
print(f" FOUND ({result['confidence']}, {result['method']}): "
|
||||
f"{result['url']}")
|
||||
else:
|
||||
not_found += 1
|
||||
|
||||
if (i + 1) % 20 == 0:
|
||||
db.commit()
|
||||
|
||||
# Rate limit: ~2s between providers (DDG + validation requests)
|
||||
time.sleep(CRAWL_DELAY * 2)
|
||||
|
||||
db.commit()
|
||||
print(f"\nDone: {found} websites found, {not_found} not found")
|
||||
print(f" Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = None
|
||||
state = None
|
||||
|
||||
for arg in sys.argv[1:]:
|
||||
if arg.startswith("--state="):
|
||||
state = arg.split("=")[1]
|
||||
elif arg.startswith("--limit="):
|
||||
limit = int(arg.split("=")[1])
|
||||
else:
|
||||
try:
|
||||
limit = int(arg)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run(limit=limit, state_filter=state)
|
||||
393
crawlers/enrich_websites.py
Normal file
393
crawlers/enrich_websites.py
Normal file
@@ -0,0 +1,393 @@
|
||||
"""Website enrichment module.
|
||||
|
||||
For each provider with a website but no packages yet, crawls their site
|
||||
to find pricing/packages pages and extracts structured data.
|
||||
|
||||
Two extraction modes:
|
||||
1. Direct HTML parsing (for sites with clear pricing structure)
|
||||
2. AI extraction via API call (for complex/varied layouts)
|
||||
|
||||
This module handles the crawling and page discovery.
|
||||
AI extraction is delegated to the N8N workflow (Claude Haiku node).
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.error
|
||||
from pathlib import Path
|
||||
|
||||
from base import fetch_url, get_db, CRAWL_DELAY
|
||||
|
||||
# Common URL patterns for pricing/packages pages
|
||||
PRICING_PATHS = [
|
||||
"/pricing",
|
||||
"/prices",
|
||||
"/our-prices",
|
||||
"/packages",
|
||||
"/funeral-packages",
|
||||
"/services",
|
||||
"/our-services",
|
||||
"/funeral-costs",
|
||||
"/funeral-services",
|
||||
"/service-options",
|
||||
"/price-list",
|
||||
"/transparency",
|
||||
"/funeral-pricing",
|
||||
"/costs",
|
||||
"/cremation",
|
||||
"/cremation-packages",
|
||||
"/burial",
|
||||
"/plan-a-funeral",
|
||||
"/arrange",
|
||||
]
|
||||
|
||||
# Keywords that suggest a link leads to pricing
|
||||
PRICING_KEYWORDS = [
|
||||
"pric", "cost", "packag", "service", "plan",
|
||||
"cremation", "burial", "funeral",
|
||||
"transparency", "disclosure",
|
||||
]
|
||||
|
||||
|
||||
def find_pricing_page(base_url: str, homepage_html: str) -> str | None:
|
||||
"""Try to find the pricing/packages page URL.
|
||||
|
||||
Strategy:
|
||||
1. Try common URL patterns
|
||||
2. Parse homepage links for pricing-related keywords
|
||||
"""
|
||||
base = base_url.rstrip("/")
|
||||
|
||||
# Strategy 1: Try common paths
|
||||
for path in PRICING_PATHS:
|
||||
test_url = base + path
|
||||
try:
|
||||
html = fetch_url(test_url, timeout=10)
|
||||
# Verify it's not a 404 soft-redirect (check for pricing content)
|
||||
if len(html) > 1000 and ("$" in html or "price" in html.lower()):
|
||||
return test_url
|
||||
except (urllib.error.HTTPError, urllib.error.URLError, Exception):
|
||||
continue
|
||||
time.sleep(0.3)
|
||||
|
||||
# Strategy 2: Parse homepage links
|
||||
link_pattern = re.compile(
|
||||
r'<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>',
|
||||
re.IGNORECASE | re.DOTALL
|
||||
)
|
||||
|
||||
for match in link_pattern.finditer(homepage_html):
|
||||
href = match.group(1)
|
||||
text = re.sub(r"<[^>]+>", "", match.group(2)).lower().strip()
|
||||
href_lower = href.lower()
|
||||
|
||||
# Check if link text or URL contains pricing keywords
|
||||
if any(kw in text or kw in href_lower for kw in PRICING_KEYWORDS):
|
||||
# Resolve relative URLs
|
||||
if href.startswith("/"):
|
||||
full_url = base + href
|
||||
elif href.startswith("http"):
|
||||
# Only follow links to the same domain
|
||||
if urllib.parse.urlparse(base).netloc in href:
|
||||
full_url = href
|
||||
else:
|
||||
continue
|
||||
else:
|
||||
full_url = base + "/" + href
|
||||
|
||||
try:
|
||||
html = fetch_url(full_url, timeout=10)
|
||||
if len(html) > 500:
|
||||
return full_url
|
||||
except Exception:
|
||||
continue
|
||||
time.sleep(0.3)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def extract_description(html: str) -> str | None:
|
||||
"""Extract a business description from homepage HTML."""
|
||||
# Try meta description first
|
||||
meta_match = re.search(
|
||||
r'<meta\s+(?:name="description"\s+content="([^"]+)"|content="([^"]+)"\s+name="description")',
|
||||
html, re.IGNORECASE
|
||||
)
|
||||
if meta_match:
|
||||
desc = meta_match.group(1) or meta_match.group(2)
|
||||
if desc and len(desc) > 20:
|
||||
return desc.strip()
|
||||
|
||||
# Try OG description
|
||||
og_match = re.search(
|
||||
r'<meta\s+property="og:description"\s+content="([^"]+)"',
|
||||
html, re.IGNORECASE
|
||||
)
|
||||
if og_match and len(og_match.group(1)) > 20:
|
||||
return og_match.group(1).strip()
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def extract_contact_info(html: str) -> dict:
|
||||
"""Extract contact details from HTML."""
|
||||
info = {}
|
||||
|
||||
# Phone
|
||||
phone_match = re.search(r'href="tel:([^"]+)"', html)
|
||||
if phone_match:
|
||||
info["phone"] = phone_match.group(1).strip()
|
||||
|
||||
# Email
|
||||
email_match = re.search(r'href="mailto:([^"?]+)"', html)
|
||||
if email_match:
|
||||
info["email"] = email_match.group(1).strip()
|
||||
|
||||
# Address from JSON-LD
|
||||
addr_match = re.search(r'"streetAddress"\s*:\s*"([^"]*)"', html)
|
||||
if addr_match:
|
||||
info["address"] = addr_match.group(1)
|
||||
|
||||
return info
|
||||
|
||||
|
||||
def check_has_pricing(html: str) -> bool:
|
||||
"""Quick check whether a page contains pricing information."""
|
||||
# Look for dollar signs near numbers
|
||||
price_pattern = re.compile(r'\$[\d,]+(?:\.\d{2})?')
|
||||
prices_found = price_pattern.findall(html)
|
||||
|
||||
# Filter out tiny amounts (likely not funeral pricing)
|
||||
significant_prices = []
|
||||
for p in prices_found:
|
||||
cleaned = p.replace("$", "").replace(",", "").strip()
|
||||
if not cleaned:
|
||||
continue
|
||||
try:
|
||||
amount = float(cleaned)
|
||||
except ValueError:
|
||||
continue
|
||||
if amount >= 100:
|
||||
significant_prices.append(amount)
|
||||
|
||||
return len(significant_prices) >= 1
|
||||
|
||||
|
||||
def prepare_for_ai_extraction(html: str) -> str:
|
||||
"""Clean HTML for AI extraction — remove noise, keep content."""
|
||||
# Remove script and style tags
|
||||
cleaned = re.sub(r"<script[^>]*>.*?</script>", "", html,
|
||||
flags=re.DOTALL | re.IGNORECASE)
|
||||
cleaned = re.sub(r"<style[^>]*>.*?</style>", "", cleaned,
|
||||
flags=re.DOTALL | re.IGNORECASE)
|
||||
|
||||
# Remove HTML comments
|
||||
cleaned = re.sub(r"<!--.*?-->", "", cleaned, flags=re.DOTALL)
|
||||
|
||||
# Remove nav, header, footer elements
|
||||
for tag in ["nav", "header", "footer"]:
|
||||
cleaned = re.sub(
|
||||
rf"<{tag}[^>]*>.*?</{tag}>", "", cleaned,
|
||||
flags=re.DOTALL | re.IGNORECASE
|
||||
)
|
||||
|
||||
# Strip remaining tags but keep text
|
||||
text = re.sub(r"<[^>]+>", " ", cleaned)
|
||||
# Collapse whitespace
|
||||
text = re.sub(r"\s+", " ", text).strip()
|
||||
|
||||
# Truncate to ~8000 chars (fits well within Haiku context)
|
||||
if len(text) > 8000:
|
||||
text = text[:8000] + "..."
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def enrich_provider(provider_id: int, website: str, db) -> dict:
|
||||
"""Crawl a provider's website and extract enrichment data.
|
||||
|
||||
Returns a dict with what was found.
|
||||
"""
|
||||
result = {
|
||||
"homepage_fetched": False,
|
||||
"description": None,
|
||||
"contact_info": {},
|
||||
"pricing_page_url": None,
|
||||
"has_pricing": False,
|
||||
"pricing_page_text": None, # cleaned text for AI extraction
|
||||
"pdf_links": [],
|
||||
}
|
||||
|
||||
# Step 1: Fetch homepage
|
||||
try:
|
||||
homepage = fetch_url(website, timeout=15)
|
||||
result["homepage_fetched"] = True
|
||||
except Exception as e:
|
||||
result["error"] = str(e)[:200]
|
||||
return result
|
||||
|
||||
# Step 2: Extract description and contact info
|
||||
result["description"] = extract_description(homepage)
|
||||
result["contact_info"] = extract_contact_info(homepage)
|
||||
|
||||
# Step 3: Find pricing page
|
||||
time.sleep(CRAWL_DELAY)
|
||||
pricing_url = find_pricing_page(website, homepage)
|
||||
|
||||
if pricing_url:
|
||||
result["pricing_page_url"] = pricing_url
|
||||
try:
|
||||
pricing_html = fetch_url(pricing_url, timeout=15)
|
||||
result["has_pricing"] = check_has_pricing(pricing_html)
|
||||
result["pricing_page_text"] = prepare_for_ai_extraction(pricing_html)
|
||||
|
||||
# Check for PDF links
|
||||
pdf_links = re.findall(
|
||||
r'href="([^"]*\.pdf[^"]*)"', pricing_html, re.IGNORECASE
|
||||
)
|
||||
for pdf_href in pdf_links:
|
||||
if pdf_href.startswith("/"):
|
||||
pdf_href = website.rstrip("/") + pdf_href
|
||||
elif not pdf_href.startswith("http"):
|
||||
pdf_href = website.rstrip("/") + "/" + pdf_href
|
||||
result["pdf_links"].append(pdf_href)
|
||||
|
||||
except Exception:
|
||||
pass
|
||||
else:
|
||||
# Check homepage itself for pricing
|
||||
if check_has_pricing(homepage):
|
||||
result["has_pricing"] = True
|
||||
result["pricing_page_url"] = website
|
||||
result["pricing_page_text"] = prepare_for_ai_extraction(homepage)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def run(limit: int | None = None, state_filter: str | None = None):
|
||||
"""Enrich all providers that have a website but no packages."""
|
||||
db = get_db()
|
||||
|
||||
query = """
|
||||
SELECT fb.id, fb.title, fb.website, fb.business_state
|
||||
FROM funeral_brand fb
|
||||
LEFT JOIN package p ON p.brand_id = fb.id
|
||||
WHERE fb.website IS NOT NULL
|
||||
AND fb.verified = 0
|
||||
AND p.id IS NULL
|
||||
"""
|
||||
params = []
|
||||
|
||||
if state_filter:
|
||||
query += " AND fb.business_state = ?"
|
||||
params.append(state_filter)
|
||||
|
||||
query += " ORDER BY fb.id"
|
||||
|
||||
if limit:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
providers = db.execute(query, params).fetchall()
|
||||
print(f"Providers to enrich: {len(providers)}")
|
||||
|
||||
enriched = 0
|
||||
pricing_found = 0
|
||||
failed = 0
|
||||
|
||||
for i, prov in enumerate(providers):
|
||||
if (i + 1) % 5 == 0 or i == 0:
|
||||
print(f" [{i+1}/{len(providers)}] {prov['title']}")
|
||||
|
||||
result = enrich_provider(prov["id"], prov["website"], db)
|
||||
|
||||
if not result["homepage_fetched"]:
|
||||
failed += 1
|
||||
db.execute(
|
||||
"""UPDATE funeral_brand
|
||||
SET enrichment_status = 'failed', updated_at = datetime('now')
|
||||
WHERE id = ?""",
|
||||
(prov["id"],)
|
||||
)
|
||||
continue
|
||||
|
||||
enriched += 1
|
||||
|
||||
# Update brand with discovered info
|
||||
updates = {}
|
||||
if result["description"] and not db.execute(
|
||||
"SELECT description FROM funeral_brand WHERE id = ?", (prov["id"],)
|
||||
).fetchone()["description"]:
|
||||
updates["description"] = result["description"]
|
||||
|
||||
contact = result["contact_info"]
|
||||
brand = db.execute("SELECT * FROM funeral_brand WHERE id = ?",
|
||||
(prov["id"],)).fetchone()
|
||||
if contact.get("email") and not brand["email"]:
|
||||
updates["email"] = contact["email"]
|
||||
if contact.get("phone") and not brand["phone"]:
|
||||
updates["phone"] = contact["phone"]
|
||||
|
||||
if result["has_pricing"]:
|
||||
pricing_found += 1
|
||||
updates["enrichment_status"] = "partial" # has pricing, needs AI extraction
|
||||
else:
|
||||
updates["enrichment_status"] = "partial" # homepage enriched, no pricing
|
||||
|
||||
if updates:
|
||||
set_parts = [f"{k} = ?" for k in updates]
|
||||
values = list(updates.values()) + [prov["id"]]
|
||||
db.execute(
|
||||
f"UPDATE funeral_brand SET {', '.join(set_parts)}, "
|
||||
f"updated_at = datetime('now') WHERE id = ?",
|
||||
values
|
||||
)
|
||||
|
||||
# Store pricing page text for later AI extraction
|
||||
if result["pricing_page_text"]:
|
||||
db.execute(
|
||||
"""INSERT OR REPLACE INTO source_record
|
||||
(source_name, source_id, source_url, raw_data,
|
||||
matched_brand_id, match_type)
|
||||
VALUES ('website_crawl', ?, ?, ?, ?, 'enrichment')""",
|
||||
(
|
||||
f"brand_{prov['id']}",
|
||||
result["pricing_page_url"],
|
||||
json.dumps({
|
||||
"pricing_text": result["pricing_page_text"],
|
||||
"pdf_links": result["pdf_links"],
|
||||
"has_pricing": result["has_pricing"],
|
||||
}),
|
||||
prov["id"],
|
||||
)
|
||||
)
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
db.commit()
|
||||
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
db.commit()
|
||||
print(f"\nDone: {enriched} enriched, {pricing_found} with pricing, {failed} failed")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = None
|
||||
state = None
|
||||
|
||||
for arg in sys.argv[1:]:
|
||||
if arg.startswith("--state="):
|
||||
state = arg.split("=")[1]
|
||||
elif arg.startswith("--limit="):
|
||||
limit = int(arg.split("=")[1])
|
||||
else:
|
||||
try:
|
||||
limit = int(arg)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run(limit=limit, state_filter=state)
|
||||
199
crawlers/lookup_abn.py
Normal file
199
crawlers/lookup_abn.py
Normal file
@@ -0,0 +1,199 @@
|
||||
"""ABN Lookup module via the Australian Business Register (ABR) API.
|
||||
|
||||
Enriches providers with their ABN (strongest dedup key) and validates
|
||||
that they are active registered businesses.
|
||||
|
||||
The ABR API is FREE. Requires a GUID (authentication token) from:
|
||||
https://abr.business.gov.au/Tools/WebServices
|
||||
|
||||
Configuration:
|
||||
Set ABR_GUID env var or in config.json.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import urllib.parse
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
from base import fetch_url, get_db, CRAWL_DELAY
|
||||
|
||||
# Load ABR GUID from env or config
|
||||
ABR_GUID = os.environ.get("ABR_GUID")
|
||||
if not ABR_GUID:
|
||||
config_path = os.path.join(os.path.dirname(__file__), "config.json")
|
||||
if os.path.exists(config_path):
|
||||
with open(config_path) as f:
|
||||
config = json.load(f)
|
||||
ABR_GUID = config.get("abr_guid")
|
||||
|
||||
ABR_BASE = "https://abr.business.gov.au/abrxmlsearch/AbrXmlSearch.asmx"
|
||||
|
||||
|
||||
def search_by_name(name: str, state: str | None = None,
|
||||
postcode: str | None = None) -> list[dict]:
|
||||
"""Search ABR by business name. Returns matching records."""
|
||||
if not ABR_GUID:
|
||||
print(" WARNING: ABR_GUID not configured. Skipping ABN lookup.")
|
||||
return []
|
||||
|
||||
params = {
|
||||
"name": name,
|
||||
"postcode": postcode or "",
|
||||
"legalName": "Y",
|
||||
"tradingName": "Y",
|
||||
"NSW": "Y", "SA": "Y", "ACT": "Y", "VIC": "Y",
|
||||
"WA": "Y", "NT": "Y", "QLD": "Y", "TAS": "Y",
|
||||
"authenticationGuid": ABR_GUID,
|
||||
}
|
||||
|
||||
# If state specified, only search that state
|
||||
if state:
|
||||
for s in ["NSW", "SA", "ACT", "VIC", "WA", "NT", "QLD", "TAS"]:
|
||||
params[s] = "Y" if s == state else "N"
|
||||
|
||||
url = f"{ABR_BASE}/ABRSearchByNameSimpleProtocol"
|
||||
try:
|
||||
text = fetch_url(url, method="GET", data=params, timeout=15)
|
||||
except Exception as e:
|
||||
return []
|
||||
|
||||
# Parse XML response
|
||||
results = []
|
||||
try:
|
||||
root = ET.fromstring(text)
|
||||
# The ABR response uses a default namespace
|
||||
ns = {"abr": "http://abr.business.gov.au/ABRXMLSearch/"}
|
||||
|
||||
for record in root.findall(".//abr:searchResultsRecord", ns):
|
||||
abn_elem = record.find(".//abr:ABN/abr:identifierValue", ns)
|
||||
status_elem = record.find(".//abr:ABN/abr:identifierStatus", ns)
|
||||
name_elem = (
|
||||
record.find(".//abr:mainName/abr:organisationName", ns)
|
||||
or record.find(".//abr:mainTradingName/abr:organisationName", ns)
|
||||
or record.find(".//abr:businessName/abr:organisationName", ns)
|
||||
)
|
||||
state_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:stateCode", ns)
|
||||
postcode_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:postcode", ns)
|
||||
score_elem = record.find(".//abr:nameScore", ns)
|
||||
|
||||
if abn_elem is not None:
|
||||
results.append({
|
||||
"abn": abn_elem.text,
|
||||
"status": status_elem.text if status_elem is not None else None,
|
||||
"name": name_elem.text if name_elem is not None else None,
|
||||
"state": state_elem.text if state_elem is not None else None,
|
||||
"postcode": postcode_elem.text if postcode_elem is not None else None,
|
||||
"score": int(score_elem.text) if score_elem is not None else 0,
|
||||
})
|
||||
except ET.ParseError:
|
||||
return []
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def find_best_match(name: str, state: str | None = None,
|
||||
postcode: str | None = None) -> dict | None:
|
||||
"""Find the best ABR match for a business name.
|
||||
|
||||
Returns the highest-scoring active match, or None.
|
||||
"""
|
||||
results = search_by_name(name, state, postcode)
|
||||
|
||||
# Filter to active businesses
|
||||
active = [r for r in results if r.get("status") == "Active"]
|
||||
if not active:
|
||||
return None
|
||||
|
||||
# Sort by score descending
|
||||
active.sort(key=lambda r: r.get("score", 0), reverse=True)
|
||||
|
||||
# Return best match if score is reasonable
|
||||
best = active[0]
|
||||
if best.get("score", 0) >= 80:
|
||||
return best
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def run(limit: int | None = None, state_filter: str | None = None):
|
||||
"""Look up ABNs for all providers that don't have one."""
|
||||
db = get_db()
|
||||
|
||||
query = """
|
||||
SELECT id, title, business_state, business_postcode
|
||||
FROM funeral_brand
|
||||
WHERE abn IS NULL AND verified = 0
|
||||
"""
|
||||
params = []
|
||||
|
||||
if state_filter:
|
||||
query += " AND business_state = ?"
|
||||
params.append(state_filter)
|
||||
|
||||
query += " ORDER BY id"
|
||||
|
||||
if limit:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
providers = db.execute(query, params).fetchall()
|
||||
print(f"Providers without ABN: {len(providers)}")
|
||||
|
||||
if not ABR_GUID:
|
||||
print("ERROR: ABR_GUID not configured.")
|
||||
print(" Register at: https://abr.business.gov.au/Tools/WebServices")
|
||||
print(" Then set ABR_GUID env var or add 'abr_guid' to config.json")
|
||||
return
|
||||
|
||||
found = 0
|
||||
not_found = 0
|
||||
|
||||
for i, prov in enumerate(providers):
|
||||
if (i + 1) % 20 == 0 or i == 0:
|
||||
print(f" [{i+1}/{len(providers)}] {prov['title']}")
|
||||
|
||||
match = find_best_match(
|
||||
prov["title"],
|
||||
prov["business_state"],
|
||||
prov["business_postcode"]
|
||||
)
|
||||
|
||||
if match:
|
||||
db.execute(
|
||||
"UPDATE funeral_brand SET abn = ?, updated_at = datetime('now') WHERE id = ?",
|
||||
(match["abn"], prov["id"])
|
||||
)
|
||||
found += 1
|
||||
else:
|
||||
not_found += 1
|
||||
|
||||
if (i + 1) % 50 == 0:
|
||||
db.commit()
|
||||
|
||||
time.sleep(0.5) # Be gentle with the government API
|
||||
|
||||
db.commit()
|
||||
print(f"\nDone: {found} ABNs found, {not_found} not found")
|
||||
print(f" Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = None
|
||||
state = None
|
||||
|
||||
for arg in sys.argv[1:]:
|
||||
if arg.startswith("--state="):
|
||||
state = arg.split("=")[1]
|
||||
elif arg.startswith("--limit="):
|
||||
limit = int(arg.split("=")[1])
|
||||
else:
|
||||
try:
|
||||
limit = int(arg)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run(limit=limit, state_filter=state)
|
||||
111
crawlers/run_overnight.sh
Executable file
111
crawlers/run_overnight.sh
Executable file
@@ -0,0 +1,111 @@
|
||||
#!/bin/bash
|
||||
# Full pipeline overnight run
|
||||
# Usage: ./run_overnight.sh
|
||||
#
|
||||
# Before running:
|
||||
# 1. Add your Serper API key to config.json
|
||||
# 2. Optionally add your Anthropic API key for AI extraction
|
||||
#
|
||||
# This script runs all steps sequentially and logs everything.
|
||||
|
||||
set -e
|
||||
cd "$(dirname "$0")"
|
||||
|
||||
LOG="../logs/overnight_$(date +%Y%m%d_%H%M%S).log"
|
||||
mkdir -p ../logs
|
||||
|
||||
echo "=== OVERNIGHT PIPELINE RUN ===" | tee "$LOG"
|
||||
echo "Started: $(date)" | tee -a "$LOG"
|
||||
echo "" | tee -a "$LOG"
|
||||
|
||||
# Check config
|
||||
SERPER_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('serper_api_key') or '')")
|
||||
ANTHROPIC_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('anthropic_api_key') or '')")
|
||||
|
||||
if [ -z "$SERPER_KEY" ]; then
|
||||
echo "WARNING: No Serper API key — website discovery will use DDG (slower, lower hit rate)" | tee -a "$LOG"
|
||||
else
|
||||
echo "Serper API key: configured" | tee -a "$LOG"
|
||||
fi
|
||||
|
||||
if [ -z "$ANTHROPIC_KEY" ]; then
|
||||
echo "WARNING: No Anthropic API key — AI extraction will be skipped" | tee -a "$LOG"
|
||||
else
|
||||
echo "Anthropic API key: configured" | tee -a "$LOG"
|
||||
fi
|
||||
echo "" | tee -a "$LOG"
|
||||
|
||||
# Step 1: Source crawlers
|
||||
echo "=== STEP 1: Source Crawlers ===" | tee -a "$LOG"
|
||||
echo "[$(date +%H:%M:%S)] Running VIC Register crawler..." | tee -a "$LOG"
|
||||
python3 crawl_vic_register.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
echo "[$(date +%H:%M:%S)] Running Funerals Australia crawler..." | tee -a "$LOG"
|
||||
python3 crawl_funerals_australia.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
echo "[$(date +%H:%M:%S)] Running NFDA crawler..." | tee -a "$LOG"
|
||||
python3 crawl_nfda.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Step 2: Deduplication
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 2: Deduplication ===" | tee -a "$LOG"
|
||||
echo "[$(date +%H:%M:%S)] Running dedup..." | tee -a "$LOG"
|
||||
python3 dedup.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Step 3: Website discovery (all providers without one)
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 3: Website Discovery ===" | tee -a "$LOG"
|
||||
NEED_WEBSITE=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NULL AND verified=0').fetchone()[0])")
|
||||
echo "[$(date +%H:%M:%S)] Providers needing websites: $NEED_WEBSITE" | tee -a "$LOG"
|
||||
|
||||
# Process in batches of 200 to avoid issues
|
||||
BATCH=200
|
||||
OFFSET=0
|
||||
while [ $OFFSET -lt $NEED_WEBSITE ]; do
|
||||
REMAINING=$((NEED_WEBSITE - OFFSET))
|
||||
CURRENT=$((REMAINING < BATCH ? REMAINING : BATCH))
|
||||
echo "[$(date +%H:%M:%S)] Discovering websites batch $((OFFSET/BATCH + 1)) ($CURRENT providers)..." | tee -a "$LOG"
|
||||
python3 discover_websites.py --limit=$CURRENT 2>&1 | tee -a "$LOG"
|
||||
OFFSET=$((OFFSET + BATCH))
|
||||
# Brief pause between batches
|
||||
sleep 5
|
||||
done
|
||||
|
||||
# Step 4: Website enrichment (all with website, not yet enriched)
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 4: Website Enrichment ===" | tee -a "$LOG"
|
||||
NEED_ENRICH=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL AND enrichment_status=\"pending\" AND verified=0').fetchone()[0])")
|
||||
echo "[$(date +%H:%M:%S)] Providers needing enrichment: $NEED_ENRICH" | tee -a "$LOG"
|
||||
python3 enrich_websites.py --limit=$NEED_ENRICH 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Step 5: Compute tiers
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 5: Compute Tiers ===" | tee -a "$LOG"
|
||||
python3 compute_tiers.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Final summary
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== FINAL SUMMARY ===" | tee -a "$LOG"
|
||||
python3 -c "
|
||||
from base import get_db
|
||||
db = get_db()
|
||||
print('Database Status:')
|
||||
print(f' Total providers: {db.execute(\"SELECT COUNT(*) FROM funeral_brand\").fetchone()[0]}')
|
||||
print(f' With phone: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE phone IS NOT NULL\").fetchone()[0]}')
|
||||
print(f' With email: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE email IS NOT NULL\").fetchone()[0]}')
|
||||
print(f' With website: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL\").fetchone()[0]}')
|
||||
print(f' With description: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE description IS NOT NULL\").fetchone()[0]}')
|
||||
print()
|
||||
print('Listing Tiers:')
|
||||
for row in db.execute('SELECT listing_tier, COUNT(*) as n FROM funeral_brand GROUP BY listing_tier ORDER BY n DESC'):
|
||||
print(f' {row[0]:12s} {row[1]:>6d}')
|
||||
print()
|
||||
print('Pricing Pages:')
|
||||
print(f' Total crawled: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\'\").fetchone()[0]}')
|
||||
print(f' With pricing: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.has_pricing\\')=1\").fetchone()[0]}')
|
||||
print(f' With PDF links: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.pdf_links\\') != \\'[]\\'\").fetchone()[0]}')
|
||||
" 2>&1 | tee -a "$LOG"
|
||||
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "Finished: $(date)" | tee -a "$LOG"
|
||||
echo "Log saved to: $LOG"
|
||||
Reference in New Issue
Block a user