Initial commit: funeral provider discovery pipeline
Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
This commit is contained in:
25
.gitignore
vendored
Normal file
25
.gitignore
vendored
Normal file
@@ -0,0 +1,25 @@
|
||||
# Secrets — never commit
|
||||
crawlers/config.json
|
||||
.env
|
||||
.env.*
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.pyo
|
||||
*.pyd
|
||||
.venv/
|
||||
venv/
|
||||
|
||||
# Logs
|
||||
logs/
|
||||
*.log
|
||||
|
||||
# OS / editor
|
||||
.DS_Store
|
||||
.idea/
|
||||
.vscode/
|
||||
|
||||
# n8n local state (if anyone docker-composes in-repo)
|
||||
n8n/.n8n/
|
||||
n8n/n8n_data/
|
||||
35
CLAUDE.md
Normal file
35
CLAUDE.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Claude Code orientation
|
||||
|
||||
You've been handed a funeral-provider discovery pipeline. Before doing anything:
|
||||
|
||||
1. Read `README.md` for the repo layout.
|
||||
2. Read `n8n/PROCESS.md` for the end-to-end flow and how data conforms to the DB schema. **This is the authoritative doc.**
|
||||
3. Read `crawlers/PIPELINE.md` for Python module internals.
|
||||
|
||||
## Project shape
|
||||
|
||||
- `crawlers/` — Python modules, one per data source. Invoked either by `run_overnight.sh` (manual) or by n8n workflows via `executeCommand`.
|
||||
- `n8n/workflows/*.json` — four scheduled workflows that drive the pipeline end-to-end.
|
||||
- `database/providers.db` — live SQLite snapshot (~1,463 providers, 121 with pricing). Safe to inspect; re-creatable from `schema_sqlite.sql`.
|
||||
|
||||
## Key constraints
|
||||
|
||||
- **Never write to `funeral_brand.verified` or `funeral_brand.hidden`** — those are admin-only. The pipeline keeps providers hidden and unverified until a human reviews them.
|
||||
- **Do not use Gathered Here data as a source of truth.** It's a competitor. `crawl_gathered_here.py` exists as historical tooling but isn't part of the active pipeline — all enrichment comes from providers' own websites or regulatory disclosure PDFs.
|
||||
- **Listing tier is computed, not stored as the source of truth.** `compute_tiers.py` derives it from package/inclusion data. Don't set it manually.
|
||||
|
||||
## Running locally
|
||||
|
||||
You'll need a Serper API key (free 2,500/mo at serper.dev) to do website discovery. Everything else can run without keys, though AI pricing extraction in Workflow 3 needs an Anthropic key.
|
||||
|
||||
```
|
||||
cp crawlers/config.example.json crawlers/config.json
|
||||
# add keys to config.json
|
||||
cd crawlers && ./run_overnight.sh
|
||||
```
|
||||
|
||||
## Things that aren't here
|
||||
|
||||
- No live secrets / API keys — `crawlers/config.json` is gitignored, use `config.example.json` as a template.
|
||||
- No admin review UI — that's a separate frontend project.
|
||||
- No Postgres migration tooling — `database/schema.sql` is the target, but the repo uses SQLite for dev.
|
||||
98
README.md
Normal file
98
README.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Provider Crawl
|
||||
|
||||
Automated pipeline for discovering Australian funeral directors from public
|
||||
registries, finding their websites, and extracting pricing data. Feeds the
|
||||
Funeral Arranger platform with a seed of unverified providers that an admin
|
||||
reviews before publishing.
|
||||
|
||||
## What's in this repo
|
||||
|
||||
```
|
||||
crawlers/ Python modules — one per data source + shared pipeline utilities
|
||||
n8n/ n8n workflows that orchestrate the crawlers on a schedule
|
||||
database/ SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB
|
||||
```
|
||||
|
||||
Three documents explain how it works, in increasing depth:
|
||||
|
||||
1. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough
|
||||
of the four workflows and how their output maps to database tables.
|
||||
**Start here.**
|
||||
2. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the
|
||||
Python modules, source-by-source notes, listing-tier logic.
|
||||
3. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with
|
||||
Docker and import the workflow JSONs.
|
||||
|
||||
## Quick start (local, no n8n)
|
||||
|
||||
```bash
|
||||
# 1. Install Python deps (requests, beautifulsoup4, rapidfuzz, pdfplumber)
|
||||
python3 -m venv .venv && source .venv/bin/activate
|
||||
pip install requests beautifulsoup4 rapidfuzz pdfplumber
|
||||
|
||||
# 2. Configure API keys
|
||||
cp crawlers/config.example.json crawlers/config.json
|
||||
# Edit crawlers/config.json and add your Serper key (free 2,500/month from serper.dev)
|
||||
# Optionally add an Anthropic API key for AI pricing extraction
|
||||
|
||||
# 3. Inspect the included dev database (1,463 providers, 121 with pricing)
|
||||
sqlite3 database/providers.db ".tables"
|
||||
sqlite3 database/providers.db "SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier"
|
||||
|
||||
# 4. Or start fresh and run the pipeline end-to-end
|
||||
rm database/providers.db
|
||||
sqlite3 database/providers.db < database/schema_sqlite.sql
|
||||
cd crawlers
|
||||
./run_overnight.sh # equivalent to workflows 1–3 back to back
|
||||
```
|
||||
|
||||
## API keys needed
|
||||
|
||||
| Service | Purpose | Cost | Where |
|
||||
|---|---|---|---|
|
||||
| Serper.dev | Website discovery via Google search | 2,500 free/mo | https://serper.dev |
|
||||
| ABR (Australian Business Register) | ABN validation | Free (GUID required) | https://abr.business.gov.au/Tools/WebServices |
|
||||
| Anthropic | AI pricing extraction (Claude Haiku) | ~$2 / full run | https://console.anthropic.com |
|
||||
|
||||
Only Serper is required to run the full pipeline. Anthropic is only needed for
|
||||
the AI extraction step in Workflow 3 (can be skipped — Python crawl still
|
||||
populates pricing text into `source_record` for manual review).
|
||||
|
||||
## Database
|
||||
|
||||
The included `database/providers.db` is a live SQLite snapshot from the last
|
||||
overnight run: 1,463 unique providers, 994 with websites, 121 with pricing.
|
||||
|
||||
Schema files:
|
||||
|
||||
- `database/schema_sqlite.sql` — SQLite schema, used for local dev
|
||||
- `database/schema.sql` — Postgres schema (production target)
|
||||
- `database/seed_verified.sql` — seed data for partner (`verified = 1`) providers
|
||||
- `database/PROVIDER-SCHEMA-SPEC.md` — schema commentary and design notes
|
||||
- `database/IMAGE-MAPPING.md` — asset conventions for verified providers
|
||||
|
||||
Per `n8n/PROCESS.md`, the pipeline only writes to a defined subset of columns —
|
||||
`verified`, `hidden`, branding, and `funeral_home_id` are admin-only.
|
||||
|
||||
## Status
|
||||
|
||||
What's built and working:
|
||||
|
||||
- All three source crawlers (VIC Register, Funerals Australia, NFDA)
|
||||
- Cross-source deduplication with fuzzy name/postcode/ABN matching
|
||||
- Website discovery via Serper + DDG + URL guessing
|
||||
- ABN lookup via ABR
|
||||
- Website enrichment: meta descriptions, pricing page discovery, PDF detection
|
||||
- Listing tier computation
|
||||
- Four n8n workflow JSONs ready to import
|
||||
|
||||
What's not yet built (open to pick up):
|
||||
|
||||
- Google Places API integration for richer location data (ratings, place_id)
|
||||
- Playwright-based enrichment for JS-rendered sites (~37% of sites)
|
||||
- Admin review UI for approving hidden providers before they go live
|
||||
- Stale package cleanup in the monthly refresh
|
||||
|
||||
## Contact
|
||||
|
||||
Richie — richie@tensordesign.com.au
|
||||
215
crawlers/PIPELINE.md
Normal file
215
crawlers/PIPELINE.md
Normal file
@@ -0,0 +1,215 @@
|
||||
# Provider Discovery & Enrichment Pipeline
|
||||
|
||||
## Architecture: Multi-Step Enrichment
|
||||
|
||||
The pipeline builds provider profiles progressively, never relying on
|
||||
competitor data. Each step adds richer detail from more authoritative sources.
|
||||
|
||||
```
|
||||
STEP 1: DISCOVER STEP 2: FIND WEBSITE STEP 3: ENRICH
|
||||
───────────────── ──────────────────── ──────────────
|
||||
|
||||
VIC Register ─────┐ ┌─ Fetch homepage
|
||||
NFDA Directory ───┼─▶ Basic Google Places API ──┐ │ Find /pricing page
|
||||
Funerals AU ──────┘ Provider ABN Lookup ─────────┼─▶ URL ──┤ Download PDFs
|
||||
Record Search engines ─────┘ │ AI extract packages
|
||||
└─▶ Structured data
|
||||
name website URL description
|
||||
address Google rating packages[]
|
||||
phone Google reviews inclusions[]
|
||||
email place_id pricing
|
||||
state ABN (validated)
|
||||
```
|
||||
|
||||
## Step 1: Discovery (DONE — all modules built and tested)
|
||||
|
||||
Sources:
|
||||
- VIC Consumer Affairs Register (796 records, VIC only) → `crawl_vic_register.py`
|
||||
- Funerals Australia AJAX API (997 records, national) → `crawl_funerals_australia.py`
|
||||
- NFDA WPSL API (209 records, national) → `crawl_nfda.py`
|
||||
|
||||
Orchestrator: `crawl_all.py`
|
||||
Deduplication: `dedup.py` (fuzzy name + postcode + ABN matching)
|
||||
|
||||
Output: ~1,463 unique providers with basic contact info.
|
||||
Stored in: funeral_brand + location tables in `database/providers.db`.
|
||||
|
||||
## Step 2: Website Discovery (DONE — module built and tested)
|
||||
|
||||
Module: `discover_websites.py`
|
||||
Test result: 50% success rate on initial batch (DDG search + URL guessing)
|
||||
Can be improved with Google Places API for higher hit rate.
|
||||
|
||||
For each provider that lacks a website URL:
|
||||
|
||||
### 2a. Serper.dev — Google search API (PRIMARY)
|
||||
- Input: "{business name} {suburb} {state}"
|
||||
- Returns: Google organic search results as JSON (title, link, snippet)
|
||||
- Cost: **2,500 free queries** (no CC needed), then $1/1K
|
||||
- Covers our entire 1,463 providers for $0
|
||||
- Filters out directories/aggregators, validates first result
|
||||
- Module: `discover_websites.py` with `search_serper()`
|
||||
|
||||
### 2b. DuckDuckGo lite (FALLBACK)
|
||||
- Free, no API key, but aggressive rate limiting
|
||||
- Used when Serper key not configured or quota exhausted
|
||||
- Module: `discover_websites.py` with `search_ddg()`
|
||||
|
||||
### 2c. URL pattern guessing (SUPPLEMENTARY)
|
||||
- Generates candidate domains from business name (e.g. smithfunerals.com.au)
|
||||
- HTTP HEAD to check if live, then validate content
|
||||
- Module: `discover_websites.py` with `guess_urls()`
|
||||
|
||||
### 2d. ABN Lookup — Australian Business Register (ENRICHMENT)
|
||||
- Input: business name + state
|
||||
- Returns: ABN, entity status, registered state/postcode
|
||||
- Cost: **FREE** (government API, requires GUID registration)
|
||||
- Validates business is active, gives strongest dedup key
|
||||
- Does NOT return website URLs
|
||||
- Module: `lookup_abn.py`
|
||||
- Register for GUID: https://abr.business.gov.au/Tools/WebServices
|
||||
|
||||
### 2e. Google Places API (OPTIONAL PREMIUM)
|
||||
- Input: "{business name}, {suburb} {state}"
|
||||
- Returns: website, rating, review count, place_id, formatted phone
|
||||
- Cost: 1,000 free/month (Enterprise tier), then ~$25/1K
|
||||
- Best data quality but most expensive
|
||||
- Not yet implemented — add when budget allows
|
||||
|
||||
### 2f. URL validation
|
||||
- Fetch discovered URL, verify it loads
|
||||
- Check page title/content mentions the business name
|
||||
- Reject generic directories (yellowpages, truelocal, etc.)
|
||||
- Mark confidence level: confirmed / probable / unverified
|
||||
|
||||
## Step 3: Website Enrichment (DONE — module built and tested)
|
||||
|
||||
Module: `enrich_websites.py`
|
||||
- Finds pricing pages via 20+ URL patterns + link following
|
||||
- Extracts description from meta tags
|
||||
- Extracts contact info (phone, email, address)
|
||||
- Stores cleaned pricing page text for AI extraction
|
||||
- Detects PDF links for PDF-based pricing extraction
|
||||
|
||||
For each provider with a confirmed website:
|
||||
|
||||
### 3a. Homepage crawl
|
||||
- Fetch homepage HTML
|
||||
- Extract: description/about text, contact details
|
||||
- Look for links to pricing/services pages
|
||||
|
||||
### 3b. Pricing page discovery
|
||||
Try common URL patterns:
|
||||
/pricing, /prices, /packages, /services, /our-services,
|
||||
/funeral-costs, /funeral-packages, /service-options,
|
||||
/price-list, /transparency
|
||||
|
||||
Also:
|
||||
- Parse sitemap.xml if available
|
||||
- Follow links containing "pric", "packag", "cost", "service"
|
||||
- Check for PDF links on pricing pages
|
||||
|
||||
### 3c. AI extraction (Claude Haiku)
|
||||
- Send pricing page HTML to Haiku
|
||||
- Extract: package names, funeral types, prices, inclusions
|
||||
- Map to known inclusion types where possible
|
||||
- Return confidence score
|
||||
|
||||
### 3d. PDF extraction (for InvoCare-type sites)
|
||||
- Download compliance PDFs
|
||||
- Extract text (pdftotext or similar)
|
||||
- Send to Haiku for structured extraction
|
||||
- ~25% of sites are PDF-only for pricing
|
||||
|
||||
## Listing Tiers
|
||||
|
||||
Providers are assigned a `listing_tier` based on data quality. Computed
|
||||
automatically by `compute_tiers.py` after each enrichment run.
|
||||
|
||||
| Tier | Label | Criteria | Display |
|
||||
|------|-------|----------|---------|
|
||||
| `verified` | Full partner | `verified = true` (signed up) | Full branding, packages, arrangements |
|
||||
| `priced` | Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail |
|
||||
| `estimated` | Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns |
|
||||
| `listed` | Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt |
|
||||
|
||||
Each tier below `verified` motivates the provider to sign up:
|
||||
- `listed` → "Publish your pricing to attract more families"
|
||||
- `estimated` → "Add detailed breakdowns to stand out"
|
||||
- `priced` → "Sign up to enable online arrangements"
|
||||
|
||||
## Enrichment Status Flow
|
||||
|
||||
```
|
||||
pending ──▶ website_found ──▶ partial ──▶ complete
|
||||
│ │ │
|
||||
└──▶ no_website_found failed (retry later)
|
||||
```
|
||||
|
||||
## N8N Workflow Design
|
||||
|
||||
### Workflow 1: Weekly Discovery
|
||||
Cron → Run all source crawlers → Dedup into DB → Queue new providers
|
||||
|
||||
### Workflow 2: Daily Website Discovery
|
||||
Cron → Fetch providers with no website → Google Places lookup
|
||||
→ ABN lookup → Search fallback → Update DB
|
||||
|
||||
### Workflow 3: Daily Enrichment
|
||||
Cron → Fetch providers with website but no packages
|
||||
→ Crawl website → AI extract → Update DB
|
||||
|
||||
### Workflow 4: Monthly Re-check
|
||||
Cron → Re-crawl enriched providers → Update pricing if changed
|
||||
|
||||
---
|
||||
|
||||
## Module Inventory
|
||||
|
||||
| Module | Purpose | N8N Workflow |
|
||||
|--------|---------|-------------|
|
||||
| `base.py` | Shared HTTP, DB, normalization utils | Used by all |
|
||||
| `crawl_vic_register.py` | VIC government register (796 records) | Workflow 1 |
|
||||
| `crawl_funerals_australia.py` | Funerals Australia API (997 records) | Workflow 1 |
|
||||
| `crawl_nfda.py` | NFDA directory API (209 records) | Workflow 1 |
|
||||
| `crawl_all.py` | Orchestrates all source crawlers | Workflow 1 |
|
||||
| `dedup.py` | Cross-source dedup & merge engine | Workflow 1 |
|
||||
| `discover_websites.py` | Find provider websites (Serper/DDG/guess) | Workflow 2 |
|
||||
| `lookup_abn.py` | ABN validation via ABR API (free) | Workflow 2 |
|
||||
| `enrich_websites.py` | Crawl provider sites, find pricing pages | Workflow 3 |
|
||||
| `compute_tiers.py` | Compute listing_tier from data quality | After enrichment |
|
||||
| `config.example.json` | API key template | — |
|
||||
|
||||
## API Keys Required
|
||||
|
||||
| Service | Key | Cost | Register |
|
||||
|---------|-----|------|----------|
|
||||
| Serper.dev | `serper_api_key` | 2,500 free, then $1/1K | https://serper.dev |
|
||||
| ABR (ABN Lookup) | `abr_guid` | Free | https://abr.business.gov.au/Tools/WebServices |
|
||||
| Anthropic (Haiku) | `anthropic_api_key` | ~$2/full run | https://console.anthropic.com |
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Configure API keys
|
||||
cp config.example.json config.json
|
||||
# Edit config.json with your keys
|
||||
|
||||
# 2. Reset database
|
||||
cd ../database
|
||||
sqlite3 providers.db < schema_sqlite.sql
|
||||
|
||||
# 3. Run full discovery pipeline
|
||||
cd ../crawlers
|
||||
python3 crawl_all.py # Step 1: Discover from registries
|
||||
python3 dedup.py # Deduplicate across sources
|
||||
python3 lookup_abn.py # Step 2a: Get ABNs (free)
|
||||
python3 discover_websites.py # Step 2b: Find websites
|
||||
python3 enrich_websites.py # Step 3: Crawl for pricing
|
||||
python3 compute_tiers.py # Assign listing tiers
|
||||
|
||||
# Test mode (limited records)
|
||||
python3 crawl_all.py --test
|
||||
python3 discover_websites.py --limit=10 --state=VIC
|
||||
python3 enrich_websites.py --limit=5
|
||||
```
|
||||
164
crawlers/base.py
Normal file
164
crawlers/base.py
Normal file
@@ -0,0 +1,164 @@
|
||||
"""Base crawler module with shared utilities."""
|
||||
|
||||
import gzip
|
||||
import io
|
||||
import json
|
||||
import time
|
||||
import sqlite3
|
||||
import urllib.request
|
||||
import urllib.parse
|
||||
import urllib.error
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
DB_PATH = Path(__file__).parent.parent / "database" / "providers.db"
|
||||
CRAWL_DELAY = 1.0 # seconds between requests (courtesy)
|
||||
|
||||
USER_AGENT = (
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
|
||||
)
|
||||
|
||||
|
||||
def fetch_url(url: str, method: str = "GET", data: dict | None = None,
|
||||
headers: dict | None = None, timeout: int = 30) -> str:
|
||||
"""Fetch a URL and return the response body as text."""
|
||||
hdrs = {"User-Agent": USER_AGENT}
|
||||
if headers:
|
||||
hdrs.update(headers)
|
||||
|
||||
body = None
|
||||
if data and method == "POST":
|
||||
body = urllib.parse.urlencode(data, doseq=True).encode("utf-8")
|
||||
hdrs.setdefault("Content-Type", "application/x-www-form-urlencoded")
|
||||
elif data and method == "GET":
|
||||
url = url + "?" + urllib.parse.urlencode(data, doseq=True)
|
||||
|
||||
req = urllib.request.Request(url, data=body, headers=hdrs, method=method)
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
raw = resp.read()
|
||||
# Handle gzip-compressed responses
|
||||
if resp.headers.get("Content-Encoding") == "gzip" or raw[:2] == b"\x1f\x8b":
|
||||
raw = gzip.decompress(raw)
|
||||
charset = resp.headers.get_content_charset() or "utf-8"
|
||||
return raw.decode(charset)
|
||||
|
||||
|
||||
def fetch_json(url: str, method: str = "GET", data: dict | None = None,
|
||||
headers: dict | None = None) -> dict:
|
||||
"""Fetch a URL and parse the response as JSON."""
|
||||
text = fetch_url(url, method=method, data=data, headers=headers)
|
||||
return json.loads(text)
|
||||
|
||||
|
||||
def get_db() -> sqlite3.Connection:
|
||||
"""Get a connection to the SQLite database."""
|
||||
conn = sqlite3.connect(str(DB_PATH))
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
conn.row_factory = sqlite3.Row
|
||||
return conn
|
||||
|
||||
|
||||
def start_crawl_log(db: sqlite3.Connection, source_name: str) -> int:
|
||||
"""Create a source_log entry and return its ID."""
|
||||
cur = db.execute(
|
||||
"INSERT INTO source_log (source_name) VALUES (?)",
|
||||
(source_name,)
|
||||
)
|
||||
db.commit()
|
||||
return cur.lastrowid
|
||||
|
||||
|
||||
def finish_crawl_log(db: sqlite3.Connection, log_id: int,
|
||||
found: int, new: int, updated: int, skipped: int,
|
||||
status: str = "completed", error: str | None = None):
|
||||
"""Update a source_log entry with results."""
|
||||
db.execute(
|
||||
"""UPDATE source_log
|
||||
SET run_finished_at = datetime('now'),
|
||||
records_found = ?, records_new = ?,
|
||||
records_updated = ?, records_skipped = ?,
|
||||
status = ?, error_message = ?
|
||||
WHERE id = ?""",
|
||||
(found, new, updated, skipped, status, error, log_id)
|
||||
)
|
||||
db.commit()
|
||||
|
||||
|
||||
def store_source_record(db: sqlite3.Connection, source_name: str,
|
||||
source_id: str, source_url: str | None,
|
||||
raw_data: dict, log_id: int) -> int | None:
|
||||
"""Store a raw source record. Returns the row ID, or None if duplicate."""
|
||||
try:
|
||||
cur = db.execute(
|
||||
"""INSERT INTO source_record
|
||||
(source_name, source_id, source_url, raw_data, log_id)
|
||||
VALUES (?, ?, ?, ?, ?)""",
|
||||
(source_name, source_id, source_url, json.dumps(raw_data), log_id)
|
||||
)
|
||||
db.commit()
|
||||
return cur.lastrowid
|
||||
except sqlite3.IntegrityError:
|
||||
# Duplicate source_name + source_id — already have this record
|
||||
return None
|
||||
|
||||
|
||||
def normalize_phone(phone: str | None) -> str | None:
|
||||
"""Basic phone normalization."""
|
||||
if not phone:
|
||||
return None
|
||||
# Remove common noise
|
||||
phone = phone.strip().replace("\xa0", " ")
|
||||
# If multiple numbers, take the first
|
||||
for sep in [";", "/", "|", ","]:
|
||||
if sep in phone:
|
||||
phone = phone.split(sep)[0].strip()
|
||||
return phone or None
|
||||
|
||||
|
||||
def normalize_state(state: str | None) -> str | None:
|
||||
"""Normalize Australian state names to abbreviations."""
|
||||
if not state:
|
||||
return None
|
||||
state = state.strip().upper()
|
||||
mapping = {
|
||||
"NEW SOUTH WALES": "NSW",
|
||||
"VICTORIA": "VIC",
|
||||
"QUEENSLAND": "QLD",
|
||||
"SOUTH AUSTRALIA": "SA",
|
||||
"WESTERN AUSTRALIA": "WA",
|
||||
"TASMANIA": "TAS",
|
||||
"NORTHERN TERRITORY": "NT",
|
||||
"AUSTRALIAN CAPITAL TERRITORY": "ACT",
|
||||
"AUSTRALIA CAPITAL TERRITORY": "ACT",
|
||||
}
|
||||
result = mapping.get(state, state)
|
||||
# Only return valid Australian states
|
||||
valid = {"NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"}
|
||||
return result if result in valid else None
|
||||
|
||||
|
||||
def generate_slug(name: str) -> str:
|
||||
"""Generate a URL-safe slug from a business name."""
|
||||
import re
|
||||
slug = name.lower().strip()
|
||||
slug = re.sub(r"[''`]", "", slug) # remove apostrophes
|
||||
slug = re.sub(r"[^a-z0-9]+", "-", slug) # non-alphanum -> hyphen
|
||||
slug = slug.strip("-")
|
||||
return slug
|
||||
|
||||
|
||||
def to_intermediate(source: str, source_id: str, source_url: str | None,
|
||||
business: dict, locations: list[dict],
|
||||
packages: list[dict] | None = None) -> dict:
|
||||
"""Build the normalized intermediate format record."""
|
||||
return {
|
||||
"source": source,
|
||||
"sourceId": source_id,
|
||||
"sourceUrl": source_url,
|
||||
"scrapedAt": datetime.now(timezone.utc).isoformat(),
|
||||
"business": business,
|
||||
"locations": locations,
|
||||
"packages": packages or [],
|
||||
}
|
||||
102
crawlers/compute_tiers.py
Normal file
102
crawlers/compute_tiers.py
Normal file
@@ -0,0 +1,102 @@
|
||||
"""Compute listing_tier for all providers based on their data quality.
|
||||
|
||||
Tier logic:
|
||||
verified — brand.verified = true (signed up to platform)
|
||||
priced — has 2+ packages with at least one inclusion that has a price > 0
|
||||
estimated — has at least one package with a total price > 0
|
||||
listed — everything else (contact info only)
|
||||
|
||||
Run this after enrichment to update tiers across the board.
|
||||
"""
|
||||
|
||||
from base import get_db
|
||||
|
||||
|
||||
def compute_tier(db, brand_id: int, verified: bool) -> str:
|
||||
"""Compute the listing tier for a single brand."""
|
||||
if verified:
|
||||
return "verified"
|
||||
|
||||
# Check packages
|
||||
packages = db.execute(
|
||||
"SELECT id, title, funeral_type FROM package WHERE brand_id = ?",
|
||||
(brand_id,)
|
||||
).fetchall()
|
||||
|
||||
if not packages:
|
||||
return "listed"
|
||||
|
||||
# Count packages that have a meaningful total price
|
||||
# A package's price = sum of non-optional, non-complimentary inclusions
|
||||
packages_with_price = 0
|
||||
packages_with_itemized = 0
|
||||
|
||||
for pkg in packages:
|
||||
inclusions = db.execute(
|
||||
"""SELECT price, optional, complimentary
|
||||
FROM package_inclusion
|
||||
WHERE package_id = ?""",
|
||||
(pkg["id"],)
|
||||
).fetchall()
|
||||
|
||||
if inclusions:
|
||||
# Has itemized inclusions with prices
|
||||
priced_inclusions = [
|
||||
i for i in inclusions
|
||||
if i["price"] and float(i["price"]) > 0
|
||||
]
|
||||
if len(priced_inclusions) >= 2:
|
||||
packages_with_itemized += 1
|
||||
packages_with_price += 1
|
||||
elif len(priced_inclusions) >= 1:
|
||||
packages_with_price += 1
|
||||
else:
|
||||
# Package exists but no inclusions — check if we stored a total
|
||||
# price in the package description or via source data
|
||||
# For now, a package with a funeral_type means we at least know
|
||||
# what kind of service it is, even without breakdown
|
||||
packages_with_price += 1
|
||||
|
||||
# Tier 2 (priced): 2+ packages with itemized breakdowns
|
||||
if packages_with_itemized >= 2:
|
||||
return "priced"
|
||||
|
||||
# Tier 3 (estimated): at least one package with some price
|
||||
if packages_with_price >= 1:
|
||||
return "estimated"
|
||||
|
||||
return "listed"
|
||||
|
||||
|
||||
def run():
|
||||
"""Recompute listing_tier for all brands."""
|
||||
db = get_db()
|
||||
|
||||
brands = db.execute(
|
||||
"SELECT id, verified FROM funeral_brand"
|
||||
).fetchall()
|
||||
|
||||
counts = {"verified": 0, "priced": 0, "estimated": 0, "listed": 0}
|
||||
|
||||
for brand in brands:
|
||||
tier = compute_tier(db, brand["id"], brand["verified"])
|
||||
db.execute(
|
||||
"UPDATE funeral_brand SET listing_tier = ? WHERE id = ?",
|
||||
(tier, brand["id"])
|
||||
)
|
||||
counts[tier] += 1
|
||||
|
||||
db.commit()
|
||||
|
||||
print("Listing Tier Distribution:")
|
||||
print(f" verified: {counts['verified']:>6d} (signed-up partners)")
|
||||
print(f" priced: {counts['priced']:>6d} (full package breakdowns)")
|
||||
print(f" estimated: {counts['estimated']:>6d} (some pricing info)")
|
||||
print(f" listed: {counts['listed']:>6d} (contact info only)")
|
||||
print(f" TOTAL: {sum(counts.values()):>6d}")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
5
crawlers/config.example.json
Normal file
5
crawlers/config.example.json
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"serper_api_key": null,
|
||||
"abr_guid": null,
|
||||
"anthropic_api_key": null
|
||||
}
|
||||
70
crawlers/crawl_all.py
Normal file
70
crawlers/crawl_all.py
Normal file
@@ -0,0 +1,70 @@
|
||||
"""Run all source crawlers and then deduplicate into the provider database."""
|
||||
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from base import get_db
|
||||
|
||||
|
||||
def run_all(gathered_here_limit: int | None = None):
|
||||
"""Run all crawlers sequentially."""
|
||||
print("=" * 60)
|
||||
print("PROVIDER DISCOVERY PIPELINE")
|
||||
print("=" * 60)
|
||||
|
||||
# Import crawlers
|
||||
import crawl_nfda
|
||||
import crawl_funerals_australia
|
||||
import crawl_vic_register
|
||||
import crawl_gathered_here
|
||||
|
||||
# Run in order: fast API sources first, then slower HTML scraping
|
||||
print("\n--- 1/4: NFDA Directory ---")
|
||||
crawl_nfda.run()
|
||||
|
||||
print("\n--- 2/4: Funerals Australia ---")
|
||||
crawl_funerals_australia.run()
|
||||
|
||||
print("\n--- 3/4: VIC Consumer Affairs Register ---")
|
||||
crawl_vic_register.run()
|
||||
|
||||
print("\n--- 4/4: Gathered Here ---")
|
||||
crawl_gathered_here.run(limit=gathered_here_limit)
|
||||
|
||||
# Summary
|
||||
db = get_db()
|
||||
print("\n" + "=" * 60)
|
||||
print("CRAWL SUMMARY")
|
||||
print("=" * 60)
|
||||
|
||||
rows = db.execute(
|
||||
"""SELECT source_name,
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN matched_brand_id IS NOT NULL THEN 1 ELSE 0 END) as matched
|
||||
FROM source_record
|
||||
GROUP BY source_name"""
|
||||
).fetchall()
|
||||
|
||||
for row in rows:
|
||||
print(f" {row['source_name']:25s} {row['total']:5d} records "
|
||||
f"({row['matched']} matched)")
|
||||
|
||||
total = db.execute("SELECT COUNT(*) as n FROM source_record").fetchone()["n"]
|
||||
print(f" {'TOTAL':25s} {total:5d} records")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
limit = None
|
||||
if "--test" in sys.argv:
|
||||
limit = 10
|
||||
print("TEST MODE: Gathered Here limited to 10 profiles")
|
||||
elif len(sys.argv) > 1:
|
||||
try:
|
||||
limit = int(sys.argv[1])
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run_all(gathered_here_limit=limit)
|
||||
179
crawlers/crawl_funerals_australia.py
Normal file
179
crawlers/crawl_funerals_australia.py
Normal file
@@ -0,0 +1,179 @@
|
||||
"""Crawler for the Funerals Australia (formerly AFDA) member directory.
|
||||
|
||||
Source: https://funeralsaustralia.org.au/find-a-member/
|
||||
Method: WordPress AJAX API (POST with get_clients_list action)
|
||||
Fields: name, address (structured), phone, email, website, lat/lng, displayImage
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, normalize_state,
|
||||
generate_slug, to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "funerals_australia"
|
||||
API_URL = "https://funeralsaustralia.org.au/wp-admin/admin-ajax.php"
|
||||
|
||||
PAGE_SIZE = 200 # API supports up to 200 per page
|
||||
|
||||
|
||||
def fetch_page(offset: int = 0) -> dict:
|
||||
"""Fetch a page of all members from the Funerals Australia API.
|
||||
|
||||
The API returns all members when no postcode/suburb filter is given,
|
||||
which is more reliable than geo-filtered searches.
|
||||
"""
|
||||
form_data = {
|
||||
"action": "get_clients_list",
|
||||
"params[size]": str(PAGE_SIZE),
|
||||
"params[from]": str(offset),
|
||||
"params[forceResults]": "true",
|
||||
"params[paginated]": "true",
|
||||
}
|
||||
|
||||
text = fetch_url(API_URL, method="POST", data=form_data,
|
||||
headers={"X-Requested-With": "XMLHttpRequest"})
|
||||
return json.loads(text)
|
||||
|
||||
|
||||
def fetch_all_members() -> list[dict]:
|
||||
"""Fetch all members via pagination."""
|
||||
all_results = []
|
||||
offset = 0
|
||||
|
||||
while True:
|
||||
data = fetch_page(offset)
|
||||
results = data.get("results", [])
|
||||
total = data.get("total", 0)
|
||||
|
||||
if not results:
|
||||
break
|
||||
|
||||
all_results.extend(results)
|
||||
print(f" Fetched {len(all_results)}/{total}...")
|
||||
offset += PAGE_SIZE
|
||||
|
||||
if offset >= total:
|
||||
break
|
||||
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
return all_results
|
||||
|
||||
|
||||
def parse_address(record: dict) -> dict:
|
||||
"""Extract structured address from a Funerals Australia record."""
|
||||
addr_list = record.get("address", [])
|
||||
if addr_list and isinstance(addr_list, list) and len(addr_list) > 0:
|
||||
addr = addr_list[0]
|
||||
return {
|
||||
"line1": addr.get("line1", "").strip(),
|
||||
"city": addr.get("city", "").strip(),
|
||||
"state": normalize_state(addr.get("state")),
|
||||
"postcode": addr.get("postcode", "").strip(),
|
||||
}
|
||||
return {"line1": "", "city": "", "state": None, "postcode": ""}
|
||||
|
||||
|
||||
def to_normalized(record: dict) -> dict:
|
||||
"""Convert a Funerals Australia record to intermediate format."""
|
||||
addr = parse_address(record)
|
||||
city = addr["city"]
|
||||
if city and city == city.upper():
|
||||
city = city.title()
|
||||
|
||||
lat_val = record.get("latitude")
|
||||
lng_val = record.get("longitude")
|
||||
try:
|
||||
lat_val = float(lat_val) if lat_val else None
|
||||
lng_val = float(lng_val) if lng_val else None
|
||||
except (ValueError, TypeError):
|
||||
lat_val = lng_val = None
|
||||
|
||||
website = record.get("website", "").strip() or None
|
||||
if website and not website.startswith("http"):
|
||||
website = "https://" + website
|
||||
|
||||
business = {
|
||||
"name": record.get("name", "").strip(),
|
||||
"abn": None,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
"email": record.get("email", "").strip() or None,
|
||||
"website": website,
|
||||
"description": None,
|
||||
}
|
||||
|
||||
locations = [{
|
||||
"address": addr["line1"],
|
||||
"suburb": city,
|
||||
"state": addr["state"],
|
||||
"postcode": addr["postcode"],
|
||||
"lat": lat_val,
|
||||
"lng": lng_val,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
}]
|
||||
|
||||
source_id = record.get("id", "")
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url="https://funeralsaustralia.org.au/find-a-member/",
|
||||
business=business,
|
||||
locations=locations,
|
||||
)
|
||||
|
||||
|
||||
def run():
|
||||
"""Run the full Funerals Australia crawl."""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
all_records = []
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
|
||||
try:
|
||||
print(" Fetching all members (paginated)...")
|
||||
all_records = fetch_all_members()
|
||||
found = len(all_records)
|
||||
print(f" Total members fetched: {found}")
|
||||
|
||||
# Store records
|
||||
for record in all_records:
|
||||
source_id = record.get("id", "")
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id,
|
||||
"https://funeralsaustralia.org.au/find-a-member/",
|
||||
record, log_id
|
||||
)
|
||||
if row_id:
|
||||
normalized = to_normalized(record)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
return all_records
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
362
crawlers/crawl_gathered_here.py
Normal file
362
crawlers/crawl_gathered_here.py
Normal file
@@ -0,0 +1,362 @@
|
||||
"""Crawler for Gathered Here funeral director directory.
|
||||
|
||||
Source: https://www.gatheredhere.com.au
|
||||
Method: XML sitemap → fetch individual profile pages → parse HTML
|
||||
Fields: name, address, coords, phone, email, website, description, pricing, reviews
|
||||
"""
|
||||
|
||||
import re
|
||||
import time
|
||||
import json
|
||||
import xml.etree.ElementTree as ET
|
||||
from html.parser import HTMLParser
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, normalize_state,
|
||||
generate_slug, to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "gathered_here"
|
||||
SITEMAP_URL = "https://www.gatheredhere.com.au/sitemap/sitemap-funerals-listings-0.xml"
|
||||
BASE_URL = "https://www.gatheredhere.com.au"
|
||||
|
||||
|
||||
def fetch_all_listing_urls() -> list[str]:
|
||||
"""Fetch and parse the sitemap to get all funeral director profile URLs."""
|
||||
xml_text = fetch_url(SITEMAP_URL)
|
||||
root = ET.fromstring(xml_text)
|
||||
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
|
||||
|
||||
urls = []
|
||||
for url_elem in root.findall("sm:url", ns):
|
||||
loc = url_elem.find("sm:loc", ns)
|
||||
if loc is not None and loc.text:
|
||||
url = loc.text.strip()
|
||||
# Only include individual profile pages (singular /funeral-director/)
|
||||
if "/funeral-director/" in url and "/funeral-directors/" not in url:
|
||||
urls.append(url)
|
||||
|
||||
return urls
|
||||
|
||||
|
||||
def extract_next_data(html_text: str) -> dict | None:
|
||||
"""Extract __NEXT_DATA__ JSON from a Next.js page."""
|
||||
pattern = r'<script\s+id="__NEXT_DATA__"\s+type="application/json">(.*?)</script>'
|
||||
match = re.search(pattern, html_text, re.DOTALL)
|
||||
if match:
|
||||
try:
|
||||
return json.loads(match.group(1))
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def extract_from_next_data(next_data: dict) -> dict | None:
|
||||
"""Extract listing data from __NEXT_DATA__ props."""
|
||||
try:
|
||||
props = next_data.get("props", {}).get("pageProps", {})
|
||||
|
||||
# Structure: singleListing.listing contains the actual data
|
||||
single = props.get("singleListing", {})
|
||||
if single:
|
||||
listing = single.get("listing")
|
||||
if listing and isinstance(listing, dict):
|
||||
return listing
|
||||
|
||||
# Fallback paths
|
||||
listing = props.get("listing") or props.get("post") or props.get("data")
|
||||
return listing
|
||||
except (KeyError, TypeError):
|
||||
return None
|
||||
|
||||
|
||||
def extract_from_html(html_text: str, url: str) -> dict:
|
||||
"""Extract listing data from page HTML using regex patterns as fallback."""
|
||||
data = {"url": url}
|
||||
|
||||
# Title
|
||||
title_match = re.search(r'<h1[^>]*>(.*?)</h1>', html_text, re.DOTALL)
|
||||
if title_match:
|
||||
data["title"] = re.sub(r'<[^>]+>', '', title_match.group(1)).strip()
|
||||
|
||||
# Phone
|
||||
phone_match = re.search(r'href="tel:([^"]+)"', html_text)
|
||||
if phone_match:
|
||||
data["phone"] = phone_match.group(1).strip()
|
||||
|
||||
# Email
|
||||
email_match = re.search(r'href="mailto:([^"]+)"', html_text)
|
||||
if email_match:
|
||||
data["email"] = email_match.group(1).strip()
|
||||
|
||||
# Website
|
||||
website_match = re.search(
|
||||
r'<a[^>]*class="[^"]*website[^"]*"[^>]*href="([^"]+)"', html_text
|
||||
)
|
||||
if website_match:
|
||||
data["website"] = website_match.group(1).strip()
|
||||
|
||||
# Address from structured data
|
||||
addr_match = re.search(
|
||||
r'"streetAddress"\s*:\s*"([^"]*)"', html_text
|
||||
)
|
||||
if addr_match:
|
||||
data["address"] = addr_match.group(1)
|
||||
|
||||
locality_match = re.search(r'"addressLocality"\s*:\s*"([^"]*)"', html_text)
|
||||
if locality_match:
|
||||
data["suburb"] = locality_match.group(1)
|
||||
|
||||
region_match = re.search(r'"addressRegion"\s*:\s*"([^"]*)"', html_text)
|
||||
if region_match:
|
||||
data["state"] = region_match.group(1)
|
||||
|
||||
postcode_match = re.search(r'"postalCode"\s*:\s*"([^"]*)"', html_text)
|
||||
if postcode_match:
|
||||
data["postcode"] = postcode_match.group(1)
|
||||
|
||||
# Coordinates
|
||||
lat_match = re.search(r'"latitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
|
||||
lng_match = re.search(r'"longitude"\s*:\s*"?(-?[\d.]+)"?', html_text)
|
||||
if lat_match:
|
||||
data["lat"] = float(lat_match.group(1))
|
||||
if lng_match:
|
||||
data["lng"] = float(lng_match.group(1))
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def extract_pricing(listing_data: dict) -> dict:
|
||||
"""Extract pricing from listing meta fields."""
|
||||
meta = listing_data.get("meta", {})
|
||||
if not meta:
|
||||
return {}
|
||||
|
||||
pricing = {}
|
||||
price_fields = {
|
||||
# With viewing prices
|
||||
"cremation_no_service_viewY": "cremation_no_service_with_viewing",
|
||||
"cremation_single_viewY": "cremation_single_service_with_viewing",
|
||||
"cremation_dual_viewY": "cremation_dual_service_with_viewing",
|
||||
"cremation_graveside_viewY": "cremation_graveside_with_viewing",
|
||||
"burial_single_viewY": "burial_single_service_with_viewing",
|
||||
"burial_dual_viewY": "burial_dual_service_with_viewing",
|
||||
"burial_graveside_viewY": "burial_graveside_with_viewing",
|
||||
"burial_no_service_viewY": "burial_no_service_with_viewing",
|
||||
# Without viewing prices
|
||||
"cremation_no_service_viewN": "cremation_no_service",
|
||||
"cremation_single_viewN": "cremation_single_service",
|
||||
"cremation_dual_viewN": "cremation_dual_service",
|
||||
"cremation_graveside_viewN": "cremation_graveside",
|
||||
"burial_single_viewN": "burial_single_service",
|
||||
"burial_dual_viewN": "burial_dual_service",
|
||||
"burial_graveside_viewN": "burial_graveside",
|
||||
"burial_no_service_viewN": "burial_no_service",
|
||||
}
|
||||
|
||||
for meta_key, label in price_fields.items():
|
||||
val = meta.get(meta_key, "")
|
||||
if val:
|
||||
# Parse price string like "$2,299" to float
|
||||
cleaned = re.sub(r'[^\d.]', '', str(val))
|
||||
if cleaned:
|
||||
try:
|
||||
pricing[label] = float(cleaned)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
return pricing
|
||||
|
||||
|
||||
def pricing_to_packages(pricing: dict) -> list[dict]:
|
||||
"""Convert flat pricing dict to package format."""
|
||||
packages = []
|
||||
|
||||
# Map pricing keys to funeral types
|
||||
type_mappings = [
|
||||
("cremation_no_service", "Cremation Only"),
|
||||
("cremation_single_service", "Service & Cremation"),
|
||||
("cremation_single_service_with_viewing", "Service & Cremation"),
|
||||
("burial_single_service", "Service & Burial"),
|
||||
("burial_graveside", "Graveside Burial"),
|
||||
]
|
||||
|
||||
for price_key, funeral_type in type_mappings:
|
||||
if price_key in pricing:
|
||||
name = price_key.replace("_", " ").title()
|
||||
packages.append({
|
||||
"name": name,
|
||||
"funeralType": funeral_type,
|
||||
"price": pricing[price_key],
|
||||
"inclusions": [], # Not available from Gathered Here listing pages
|
||||
})
|
||||
|
||||
return packages
|
||||
|
||||
|
||||
def to_normalized(listing_data: dict, url: str) -> dict:
|
||||
"""Convert Gathered Here listing data to intermediate format."""
|
||||
meta = listing_data.get("meta", {}) if isinstance(listing_data.get("meta"), dict) else {}
|
||||
|
||||
name = listing_data.get("title", listing_data.get("name", "")).strip()
|
||||
slug = listing_data.get("slug", "")
|
||||
|
||||
# Extract location
|
||||
suburb = meta.get("geolocation_city", "")
|
||||
state = normalize_state(meta.get("geolocation_state_short", ""))
|
||||
postcode = meta.get("geolocation_postcode", "")
|
||||
lat = meta.get("geolocation_lat")
|
||||
lng = meta.get("geolocation_long")
|
||||
|
||||
try:
|
||||
lat = float(lat) if lat else None
|
||||
lng = float(lng) if lng else None
|
||||
except (ValueError, TypeError):
|
||||
lat = lng = None
|
||||
|
||||
email = meta.get("email", "") or meta.get("_application", "")
|
||||
phone = meta.get("phone", "") or listing_data.get("phone", "")
|
||||
|
||||
# Try to get description from content or excerpt
|
||||
description = listing_data.get("excerpt", listing_data.get("content", ""))
|
||||
if description:
|
||||
description = re.sub(r'<[^>]+>', '', description).strip()
|
||||
if len(description) > 500:
|
||||
description = description[:497] + "..."
|
||||
|
||||
# Website
|
||||
website = listing_data.get("website") or meta.get("website") or None
|
||||
|
||||
# Pricing
|
||||
pricing = extract_pricing(listing_data)
|
||||
packages = pricing_to_packages(pricing)
|
||||
|
||||
business = {
|
||||
"name": name,
|
||||
"abn": None,
|
||||
"phone": normalize_phone(phone),
|
||||
"email": email.strip() or None,
|
||||
"website": website,
|
||||
"description": description or None,
|
||||
}
|
||||
|
||||
locations = [{
|
||||
"address": meta.get("geolocation_formatted_address", ""),
|
||||
"suburb": suburb,
|
||||
"state": state,
|
||||
"postcode": postcode,
|
||||
"lat": lat,
|
||||
"lng": lng,
|
||||
"phone": normalize_phone(phone),
|
||||
}]
|
||||
|
||||
source_id = slug or generate_slug(name)
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url=url,
|
||||
business=business,
|
||||
locations=locations,
|
||||
packages=packages,
|
||||
)
|
||||
|
||||
|
||||
def crawl_profile(url: str) -> dict | None:
|
||||
"""Crawl a single Gathered Here profile page."""
|
||||
try:
|
||||
html_text = fetch_url(url)
|
||||
except Exception as e:
|
||||
print(f" Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
# Try __NEXT_DATA__ first (structured)
|
||||
next_data = extract_next_data(html_text)
|
||||
if next_data:
|
||||
listing = extract_from_next_data(next_data)
|
||||
if listing:
|
||||
listing["_source"] = "next_data"
|
||||
return listing
|
||||
|
||||
# Fallback to HTML parsing
|
||||
data = extract_from_html(html_text, url)
|
||||
data["_source"] = "html_fallback"
|
||||
return data
|
||||
|
||||
|
||||
def run(limit: int | None = None):
|
||||
"""Run the full Gathered Here crawl.
|
||||
|
||||
Args:
|
||||
limit: If set, only crawl this many profiles (for testing).
|
||||
"""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
errors = 0
|
||||
|
||||
try:
|
||||
# Step 1: Get all profile URLs from sitemap
|
||||
print(" Fetching sitemap...", end=" ", flush=True)
|
||||
urls = fetch_all_listing_urls()
|
||||
print(f"{len(urls)} profile URLs found")
|
||||
|
||||
if limit:
|
||||
urls = urls[:limit]
|
||||
print(f" (limited to {limit} for testing)")
|
||||
|
||||
# Step 2: Crawl each profile
|
||||
for i, url in enumerate(urls):
|
||||
slug = url.rstrip("/").split("/")[-1]
|
||||
|
||||
if (i + 1) % 50 == 0 or i == 0:
|
||||
print(f" Crawling {i+1}/{len(urls)}: {slug}")
|
||||
|
||||
listing_data = crawl_profile(url)
|
||||
found += 1
|
||||
|
||||
if not listing_data:
|
||||
errors += 1
|
||||
continue
|
||||
|
||||
source_id = slug
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id, url, listing_data, log_id
|
||||
)
|
||||
|
||||
if row_id:
|
||||
normalized = to_normalized(listing_data, url)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
db.commit() # periodic commit
|
||||
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, "
|
||||
f"{skipped} skipped, {errors} errors")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = int(sys.argv[1]) if len(sys.argv) > 1 else None
|
||||
run(limit=limit)
|
||||
163
crawlers/crawl_nfda.py
Normal file
163
crawlers/crawl_nfda.py
Normal file
@@ -0,0 +1,163 @@
|
||||
"""Crawler for the NFDA (National Funeral Directors Association) directory.
|
||||
|
||||
Source: https://nfda.com.au/find-your-local-nfda-member/
|
||||
Method: WPSL JSON API (GET requests with lat/lng search)
|
||||
Fields: name, address, city, state, postcode, lat/lng, phone, email
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_json, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, normalize_state,
|
||||
generate_slug, to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "nfda"
|
||||
API_URL = "https://nfda.com.au/wp-admin/admin-ajax.php"
|
||||
|
||||
# Search centroids covering Australia with large radius
|
||||
SEARCH_POINTS = [
|
||||
{"name": "Sydney", "lat": -33.87, "lng": 151.21},
|
||||
{"name": "Melbourne", "lat": -37.81, "lng": 144.96},
|
||||
{"name": "Brisbane", "lat": -27.47, "lng": 153.03},
|
||||
{"name": "Perth", "lat": -31.95, "lng": 115.86},
|
||||
{"name": "Adelaide", "lat": -34.93, "lng": 138.60},
|
||||
{"name": "Hobart", "lat": -42.88, "lng": 147.33},
|
||||
{"name": "Darwin", "lat": -12.46, "lng": 130.85},
|
||||
{"name": "Townsville", "lat": -19.26, "lng": 146.82},
|
||||
{"name": "Central NSW", "lat": -30.0, "lng": 150.0},
|
||||
{"name": "Inland QLD", "lat": -23.0, "lng": 145.0},
|
||||
]
|
||||
|
||||
|
||||
def fetch_members(lat: float, lng: float, max_results: int = 50,
|
||||
radius: int = 5000) -> list[dict]:
|
||||
"""Fetch NFDA members near a given lat/lng."""
|
||||
params = {
|
||||
"action": "store_search",
|
||||
"lat": str(lat),
|
||||
"lng": str(lng),
|
||||
"max_results": str(max_results),
|
||||
"search_radius": str(radius),
|
||||
"autoload": "1",
|
||||
}
|
||||
data = fetch_json(API_URL, method="GET", data=params)
|
||||
if isinstance(data, list):
|
||||
return data
|
||||
return []
|
||||
|
||||
|
||||
def to_normalized(record: dict) -> dict:
|
||||
"""Convert an NFDA record to intermediate format."""
|
||||
state = normalize_state(record.get("state", ""))
|
||||
|
||||
business = {
|
||||
"name": record.get("store", "").strip(),
|
||||
"abn": None,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
"email": record.get("email", "").strip() or None,
|
||||
"website": record.get("url", "").strip() or None,
|
||||
"description": None,
|
||||
}
|
||||
|
||||
lat_val = record.get("lat")
|
||||
lng_val = record.get("lng")
|
||||
try:
|
||||
lat_val = float(lat_val) if lat_val else None
|
||||
lng_val = float(lng_val) if lng_val else None
|
||||
except (ValueError, TypeError):
|
||||
lat_val = lng_val = None
|
||||
|
||||
city = record.get("city", "").strip()
|
||||
# Normalize city casing (some are ALL CAPS)
|
||||
if city and city == city.upper():
|
||||
city = city.title()
|
||||
|
||||
locations = [{
|
||||
"address": record.get("address", "").strip(),
|
||||
"suburb": city,
|
||||
"state": state,
|
||||
"postcode": record.get("zip", "").strip(),
|
||||
"lat": lat_val,
|
||||
"lng": lng_val,
|
||||
"phone": normalize_phone(record.get("phone")),
|
||||
}]
|
||||
|
||||
source_id = str(record.get("id", ""))
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url="https://nfda.com.au/find-your-local-nfda-member/",
|
||||
business=business,
|
||||
locations=locations,
|
||||
)
|
||||
|
||||
|
||||
def run():
|
||||
"""Run the full NFDA crawl."""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
seen_ids = set()
|
||||
all_records = []
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
|
||||
try:
|
||||
for point in SEARCH_POINTS:
|
||||
print(f" Searching near {point['name']}...", end=" ", flush=True)
|
||||
members = fetch_members(point["lat"], point["lng"])
|
||||
new_count = 0
|
||||
|
||||
for member in members:
|
||||
member_id = str(member.get("id", ""))
|
||||
if member_id in seen_ids:
|
||||
continue
|
||||
seen_ids.add(member_id)
|
||||
all_records.append(member)
|
||||
new_count += 1
|
||||
|
||||
print(f"{len(members)} results, {new_count} new unique")
|
||||
found += len(members)
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
print(f" Total unique members: {len(all_records)}")
|
||||
|
||||
# Store records
|
||||
for record in all_records:
|
||||
source_id = str(record.get("id", ""))
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id,
|
||||
"https://nfda.com.au/find-your-local-nfda-member/",
|
||||
record, log_id
|
||||
)
|
||||
if row_id:
|
||||
normalized = to_normalized(record)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
return all_records
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
220
crawlers/crawl_vic_register.py
Normal file
220
crawlers/crawl_vic_register.py
Normal file
@@ -0,0 +1,220 @@
|
||||
"""Crawler for the VIC Consumer Affairs Public Register of Funeral Providers.
|
||||
|
||||
Source: https://registers.consumer.vic.gov.au/fpsearch
|
||||
Method: HTTP GET per letter A-Z, parse HTML tables
|
||||
Fields: name, place of business, postcode, postal address, phone
|
||||
"""
|
||||
|
||||
import re
|
||||
import time
|
||||
import json
|
||||
import html.parser
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, start_crawl_log, finish_crawl_log,
|
||||
store_source_record, normalize_phone, generate_slug,
|
||||
to_intermediate, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
SOURCE_NAME = "vic_register"
|
||||
BASE_URL = "https://registers.consumer.vic.gov.au/FpSearch/PerformSearch"
|
||||
|
||||
|
||||
class VICTableParser(html.parser.HTMLParser):
|
||||
"""Parse the VIC register HTML table into records."""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.records = []
|
||||
self._in_table = False
|
||||
self._in_tbody = False
|
||||
self._in_row = False
|
||||
self._in_cell = False
|
||||
self._current_row = []
|
||||
self._current_cell = ""
|
||||
|
||||
def handle_starttag(self, tag, attrs):
|
||||
if tag == "table":
|
||||
self._in_table = True
|
||||
elif tag == "tbody" and self._in_table:
|
||||
self._in_tbody = True
|
||||
elif tag == "tr" and self._in_tbody:
|
||||
self._in_row = True
|
||||
self._current_row = []
|
||||
elif tag == "td" and self._in_row:
|
||||
self._in_cell = True
|
||||
self._current_cell = ""
|
||||
|
||||
def handle_endtag(self, tag):
|
||||
if tag == "td" and self._in_cell:
|
||||
self._in_cell = False
|
||||
self._current_row.append(self._current_cell.strip())
|
||||
elif tag == "tr" and self._in_row:
|
||||
self._in_row = False
|
||||
if len(self._current_row) >= 4:
|
||||
self.records.append(self._current_row)
|
||||
elif tag == "tbody":
|
||||
self._in_tbody = False
|
||||
elif tag == "table":
|
||||
self._in_table = False
|
||||
|
||||
def handle_data(self, data):
|
||||
if self._in_cell:
|
||||
self._current_cell += data
|
||||
|
||||
|
||||
def parse_address(place_of_business: str) -> dict:
|
||||
"""Parse a VIC register address into components."""
|
||||
parts = place_of_business.strip()
|
||||
# Try to extract postcode from the end
|
||||
postcode_match = re.search(r'\b(\d{4})\s*$', parts)
|
||||
postcode = postcode_match.group(1) if postcode_match else None
|
||||
|
||||
# Try to extract suburb (usually the last word(s) before postcode)
|
||||
suburb = None
|
||||
if postcode:
|
||||
before_postcode = parts[:postcode_match.start()].strip().rstrip(",").strip()
|
||||
# Last segment after comma is usually suburb
|
||||
if "," in before_postcode:
|
||||
suburb = before_postcode.split(",")[-1].strip()
|
||||
else:
|
||||
# Take last 1-2 words as suburb
|
||||
words = before_postcode.split()
|
||||
if len(words) >= 2:
|
||||
suburb = " ".join(words[-2:]) if words[-1][0].isupper() else words[-1]
|
||||
|
||||
return {
|
||||
"address": parts,
|
||||
"suburb": suburb,
|
||||
"state": "VIC",
|
||||
"postcode": postcode,
|
||||
}
|
||||
|
||||
|
||||
def crawl_letter(letter: str) -> list[dict]:
|
||||
"""Crawl all records for a single letter."""
|
||||
url = f"{BASE_URL}?Letter={letter}"
|
||||
html_text = fetch_url(url)
|
||||
|
||||
parser = VICTableParser()
|
||||
parser.feed(html_text)
|
||||
|
||||
records = []
|
||||
for row in parser.records:
|
||||
# Columns: Name, Place of Business, Postcode, Postal Address, Phone
|
||||
name = row[0] if len(row) > 0 else ""
|
||||
place = row[1] if len(row) > 1 else ""
|
||||
postcode = row[2] if len(row) > 2 else ""
|
||||
postal = row[3] if len(row) > 3 else ""
|
||||
phone = row[4] if len(row) > 4 else ""
|
||||
|
||||
if not name:
|
||||
continue
|
||||
|
||||
records.append({
|
||||
"name": name.strip(),
|
||||
"place_of_business": place.strip(),
|
||||
"postcode": postcode.strip(),
|
||||
"postal_address": postal.strip(),
|
||||
"phone": phone.strip(),
|
||||
})
|
||||
|
||||
return records
|
||||
|
||||
|
||||
def make_source_id(record: dict) -> str:
|
||||
"""Create a stable source ID from name + address."""
|
||||
name = record["name"].lower().strip()
|
||||
addr = record["place_of_business"].lower().strip()
|
||||
return f"{generate_slug(name)}_{record['postcode']}"
|
||||
|
||||
|
||||
def to_normalized(record: dict) -> dict:
|
||||
"""Convert a VIC register record to intermediate format."""
|
||||
addr = parse_address(record["place_of_business"])
|
||||
|
||||
business = {
|
||||
"name": record["name"],
|
||||
"abn": None,
|
||||
"phone": normalize_phone(record["phone"]),
|
||||
"email": None,
|
||||
"website": None,
|
||||
"description": None,
|
||||
}
|
||||
|
||||
locations = [{
|
||||
"address": record["place_of_business"],
|
||||
"suburb": addr["suburb"],
|
||||
"state": "VIC",
|
||||
"postcode": record["postcode"] or addr["postcode"],
|
||||
"lat": None,
|
||||
"lng": None,
|
||||
"phone": normalize_phone(record["phone"]),
|
||||
}]
|
||||
|
||||
source_id = make_source_id(record)
|
||||
return to_intermediate(
|
||||
source=SOURCE_NAME,
|
||||
source_id=source_id,
|
||||
source_url=f"{BASE_URL}?Letter={record['name'][0].upper()}",
|
||||
business=business,
|
||||
locations=locations,
|
||||
)
|
||||
|
||||
|
||||
def run():
|
||||
"""Run the full VIC register crawl."""
|
||||
db = get_db()
|
||||
log_id = start_crawl_log(db, SOURCE_NAME)
|
||||
print(f"[{SOURCE_NAME}] Starting crawl (log_id={log_id})")
|
||||
|
||||
all_records = []
|
||||
found = 0
|
||||
new = 0
|
||||
skipped = 0
|
||||
|
||||
try:
|
||||
for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
|
||||
print(f" Crawling letter {letter}...", end=" ", flush=True)
|
||||
records = crawl_letter(letter)
|
||||
print(f"{len(records)} records")
|
||||
all_records.extend(records)
|
||||
found += len(records)
|
||||
|
||||
if letter != "Z":
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
# Store and normalize
|
||||
for record in all_records:
|
||||
source_id = make_source_id(record)
|
||||
row_id = store_source_record(
|
||||
db, SOURCE_NAME, source_id,
|
||||
f"{BASE_URL}?Letter={record['name'][0].upper()}",
|
||||
record, log_id
|
||||
)
|
||||
if row_id:
|
||||
normalized = to_normalized(record)
|
||||
db.execute(
|
||||
"UPDATE source_record SET normalized_data = ? WHERE id = ?",
|
||||
(json.dumps(normalized), row_id)
|
||||
)
|
||||
new += 1
|
||||
else:
|
||||
skipped += 1
|
||||
|
||||
db.commit()
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped)
|
||||
print(f"[{SOURCE_NAME}] Done: {found} found, {new} new, {skipped} skipped")
|
||||
|
||||
except Exception as e:
|
||||
finish_crawl_log(db, log_id, found, new, 0, skipped, "failed", str(e))
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
return all_records
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
425
crawlers/dedup.py
Normal file
425
crawlers/dedup.py
Normal file
@@ -0,0 +1,425 @@
|
||||
"""Deduplication and merge engine.
|
||||
|
||||
Processes source_records → funeral_brand + location + package entries.
|
||||
Handles cross-source matching and field-level merging.
|
||||
|
||||
Matching hierarchy (strongest to weakest):
|
||||
1. source_key match — same record from same source (skip/update)
|
||||
2. ABN match — same business entity
|
||||
3. Name + Postcode exact match — likely same business
|
||||
4. Fuzzy name match (>85%) + same state — probable match, flag for review
|
||||
|
||||
Merge priority (higher = preferred):
|
||||
vic_register > funerals_australia > nfda > gathered_here
|
||||
|
||||
Never overwrite verified provider data.
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
import sqlite3
|
||||
from difflib import SequenceMatcher
|
||||
|
||||
from base import get_db, generate_slug, normalize_state
|
||||
|
||||
# Source priority for merge conflicts (higher number = more authoritative)
|
||||
SOURCE_PRIORITY = {
|
||||
"vic_register": 40,
|
||||
"funerals_australia": 30,
|
||||
"nfda": 20,
|
||||
"gathered_here": 10,
|
||||
}
|
||||
|
||||
|
||||
def normalize_name(name: str) -> str:
|
||||
"""Normalize a business name for comparison."""
|
||||
name = name.strip().upper()
|
||||
# Remove common suffixes
|
||||
for suffix in [" PTY LTD", " PTY. LTD.", " P/L", " LIMITED",
|
||||
" PROPRIETARY LIMITED", " INC", " LLC",
|
||||
" FUNERAL DIRECTORS", " FUNERAL SERVICES",
|
||||
" FUNERALS", " FUNERAL HOME"]:
|
||||
name = name.removesuffix(suffix)
|
||||
# Remove punctuation
|
||||
name = re.sub(r"[''`\".,&()-]", " ", name)
|
||||
name = re.sub(r"\s+", " ", name).strip()
|
||||
return name
|
||||
|
||||
|
||||
def fuzzy_match(name1: str, name2: str) -> float:
|
||||
"""Return similarity ratio between two names (0.0 to 1.0)."""
|
||||
n1 = normalize_name(name1)
|
||||
n2 = normalize_name(name2)
|
||||
return SequenceMatcher(None, n1, n2).ratio()
|
||||
|
||||
|
||||
def find_existing_brand(db: sqlite3.Connection, record: dict) -> tuple[int | None, str]:
|
||||
"""Find a matching funeral_brand for a source record.
|
||||
|
||||
Returns (brand_id, match_type) or (None, 'new').
|
||||
"""
|
||||
biz = record.get("business", {})
|
||||
locs = record.get("locations", [])
|
||||
name = biz.get("name", "")
|
||||
abn = biz.get("abn")
|
||||
source = record.get("source", "")
|
||||
source_id = record.get("sourceId", "")
|
||||
source_key = f"{source}:{source_id}"
|
||||
|
||||
postcode = None
|
||||
state = None
|
||||
if locs:
|
||||
postcode = locs[0].get("postcode")
|
||||
state = locs[0].get("state")
|
||||
|
||||
# 1. Source key match (exact same record from same source)
|
||||
row = db.execute(
|
||||
"SELECT id FROM funeral_brand WHERE source_key = ?",
|
||||
(source_key,)
|
||||
).fetchone()
|
||||
if row:
|
||||
return row["id"], "source_key"
|
||||
|
||||
# 2. ABN match
|
||||
if abn:
|
||||
row = db.execute(
|
||||
"SELECT id FROM funeral_brand WHERE abn = ?",
|
||||
(abn,)
|
||||
).fetchone()
|
||||
if row:
|
||||
return row["id"], "abn"
|
||||
|
||||
# 3. Exact name + postcode match
|
||||
if name and postcode:
|
||||
norm = normalize_name(name)
|
||||
# Check all brands — need fuzzy on name
|
||||
rows = db.execute(
|
||||
"SELECT id, title FROM funeral_brand WHERE business_postcode = ?",
|
||||
(postcode,)
|
||||
).fetchall()
|
||||
for row in rows:
|
||||
if normalize_name(row["title"]) == norm:
|
||||
return row["id"], "name_postcode"
|
||||
|
||||
# 4. Fuzzy name + same state
|
||||
if name and state:
|
||||
rows = db.execute(
|
||||
"SELECT id, title FROM funeral_brand WHERE business_state = ?",
|
||||
(state,)
|
||||
).fetchall()
|
||||
for row in rows:
|
||||
score = fuzzy_match(name, row["title"])
|
||||
if score >= 0.85:
|
||||
return row["id"], "fuzzy"
|
||||
|
||||
return None, "new"
|
||||
|
||||
|
||||
def merge_field(existing: str | None, new_val: str | None,
|
||||
existing_priority: int, new_priority: int) -> str | None:
|
||||
"""Merge a single field, preferring non-null and higher-priority."""
|
||||
if not new_val:
|
||||
return existing
|
||||
if not existing:
|
||||
return new_val
|
||||
# Both have values — prefer higher priority source
|
||||
if new_priority > existing_priority:
|
||||
return new_val
|
||||
return existing
|
||||
|
||||
|
||||
def create_brand(db: sqlite3.Connection, record: dict) -> int:
|
||||
"""Create a new funeral_brand from a source record."""
|
||||
biz = record.get("business", {})
|
||||
locs = record.get("locations", [])
|
||||
source = record.get("source", "")
|
||||
source_id = record.get("sourceId", "")
|
||||
source_key = f"{source}:{source_id}"
|
||||
|
||||
loc = locs[0] if locs else {}
|
||||
slug = generate_slug(biz.get("name", "unknown"))
|
||||
|
||||
# Ensure unique slug
|
||||
base_slug = slug
|
||||
counter = 1
|
||||
while True:
|
||||
existing = db.execute(
|
||||
"SELECT id FROM funeral_brand WHERE code = ?", (slug,)
|
||||
).fetchone()
|
||||
if not existing:
|
||||
break
|
||||
slug = f"{base_slug}-{counter}"
|
||||
counter += 1
|
||||
|
||||
cur = db.execute(
|
||||
"""INSERT INTO funeral_brand (
|
||||
title, description, email, phone, website, abn, code,
|
||||
hidden, verified, source_key, source_url, enrichment_status,
|
||||
business_address, business_suburb, business_state, business_postcode
|
||||
) VALUES (?, ?, ?, ?, ?, ?, ?, 1, 0, ?, ?, 'pending', ?, ?, ?, ?)""",
|
||||
(
|
||||
biz.get("name"),
|
||||
biz.get("description"),
|
||||
biz.get("email"),
|
||||
biz.get("phone"),
|
||||
biz.get("website"),
|
||||
biz.get("abn"),
|
||||
slug,
|
||||
source_key,
|
||||
record.get("sourceUrl"),
|
||||
loc.get("address"),
|
||||
loc.get("suburb"),
|
||||
loc.get("state"),
|
||||
loc.get("postcode"),
|
||||
)
|
||||
)
|
||||
brand_id = cur.lastrowid
|
||||
|
||||
# Create locations
|
||||
for loc_data in locs:
|
||||
title_parts = [loc_data.get("suburb", ""), loc_data.get("state", "")]
|
||||
loc_title = ", ".join(p for p in title_parts if p) or biz.get("name", "")
|
||||
|
||||
db.execute(
|
||||
"""INSERT INTO location (
|
||||
title, address, suburb, state, postcode, lat, lng, brand_id
|
||||
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(
|
||||
loc_title,
|
||||
loc_data.get("address"),
|
||||
loc_data.get("suburb"),
|
||||
loc_data.get("state"),
|
||||
loc_data.get("postcode"),
|
||||
loc_data.get("lat"),
|
||||
loc_data.get("lng"),
|
||||
brand_id,
|
||||
)
|
||||
)
|
||||
|
||||
# Create packages (from Gathered Here pricing)
|
||||
packages = record.get("packages", [])
|
||||
for pkg in packages:
|
||||
if not pkg.get("price"):
|
||||
continue
|
||||
cur = db.execute(
|
||||
"""INSERT INTO package (
|
||||
title, funeral_type, brand_id, source_url, extraction_confidence
|
||||
) VALUES (?, ?, ?, ?, ?)""",
|
||||
(
|
||||
pkg.get("name"),
|
||||
pkg.get("funeralType"),
|
||||
brand_id,
|
||||
record.get("sourceUrl"),
|
||||
0.8, # Gathered Here pricing is structured, fairly reliable
|
||||
)
|
||||
)
|
||||
pkg_id = cur.lastrowid
|
||||
|
||||
# Create inclusions if available
|
||||
for inc in pkg.get("inclusions", []):
|
||||
db.execute(
|
||||
"""INSERT INTO package_inclusion (
|
||||
price, optional, complimentary, inclusion_type_title, package_id
|
||||
) VALUES (?, ?, ?, ?, ?)""",
|
||||
(
|
||||
inc.get("price", 0),
|
||||
1 if inc.get("optional") else 0,
|
||||
1 if inc.get("complimentary") else 0,
|
||||
inc.get("item", "Unknown"),
|
||||
pkg_id,
|
||||
)
|
||||
)
|
||||
|
||||
return brand_id
|
||||
|
||||
|
||||
def update_brand(db: sqlite3.Connection, brand_id: int,
|
||||
record: dict, match_type: str) -> bool:
|
||||
"""Merge new data into an existing brand. Returns True if updated."""
|
||||
biz = record.get("business", {})
|
||||
locs = record.get("locations", [])
|
||||
source = record.get("source", "")
|
||||
new_priority = SOURCE_PRIORITY.get(source, 0)
|
||||
|
||||
# Never overwrite verified providers
|
||||
brand = db.execute(
|
||||
"SELECT * FROM funeral_brand WHERE id = ?", (brand_id,)
|
||||
).fetchone()
|
||||
if brand["verified"]:
|
||||
return False
|
||||
|
||||
# Determine existing source priority
|
||||
existing_source = ""
|
||||
if brand["source_key"]:
|
||||
existing_source = brand["source_key"].split(":")[0]
|
||||
existing_priority = SOURCE_PRIORITY.get(existing_source, 0)
|
||||
|
||||
# Field-level merge — only fill blanks or upgrade from higher priority
|
||||
updates = {}
|
||||
field_map = {
|
||||
"description": biz.get("description"),
|
||||
"email": biz.get("email"),
|
||||
"phone": biz.get("phone"),
|
||||
"website": biz.get("website"),
|
||||
"abn": biz.get("abn"),
|
||||
}
|
||||
|
||||
for field, new_val in field_map.items():
|
||||
merged = merge_field(brand[field], new_val, existing_priority, new_priority)
|
||||
if merged != brand[field]:
|
||||
updates[field] = merged
|
||||
|
||||
# Update location data if we have coords and existing doesn't
|
||||
if locs:
|
||||
loc = locs[0]
|
||||
existing_locs = db.execute(
|
||||
"SELECT * FROM location WHERE brand_id = ?", (brand_id,)
|
||||
).fetchall()
|
||||
|
||||
if not existing_locs and loc.get("suburb"):
|
||||
title_parts = [loc.get("suburb", ""), loc.get("state", "")]
|
||||
loc_title = ", ".join(p for p in title_parts if p)
|
||||
db.execute(
|
||||
"""INSERT INTO location (
|
||||
title, address, suburb, state, postcode, lat, lng, brand_id
|
||||
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(
|
||||
loc_title, loc.get("address"), loc.get("suburb"),
|
||||
loc.get("state"), loc.get("postcode"),
|
||||
loc.get("lat"), loc.get("lng"), brand_id,
|
||||
)
|
||||
)
|
||||
elif existing_locs:
|
||||
# Update first location with coords if missing
|
||||
eloc = existing_locs[0]
|
||||
if not eloc["lat"] and loc.get("lat"):
|
||||
db.execute(
|
||||
"UPDATE location SET lat = ?, lng = ? WHERE id = ?",
|
||||
(loc.get("lat"), loc.get("lng"), eloc["id"])
|
||||
)
|
||||
|
||||
# Add packages if we have them and brand doesn't yet
|
||||
packages = record.get("packages", [])
|
||||
if packages:
|
||||
existing_pkgs = db.execute(
|
||||
"SELECT COUNT(*) as n FROM package WHERE brand_id = ?", (brand_id,)
|
||||
).fetchone()["n"]
|
||||
|
||||
if existing_pkgs == 0:
|
||||
for pkg in packages:
|
||||
if not pkg.get("price"):
|
||||
continue
|
||||
cur = db.execute(
|
||||
"""INSERT INTO package (
|
||||
title, funeral_type, brand_id, source_url
|
||||
) VALUES (?, ?, ?, ?)""",
|
||||
(pkg.get("name"), pkg.get("funeralType"),
|
||||
brand_id, record.get("sourceUrl"))
|
||||
)
|
||||
|
||||
if updates:
|
||||
set_clause = ", ".join(f"{k} = ?" for k in updates)
|
||||
values = list(updates.values()) + [brand_id]
|
||||
db.execute(
|
||||
f"UPDATE funeral_brand SET {set_clause}, updated_at = datetime('now') WHERE id = ?",
|
||||
values
|
||||
)
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def process_all():
|
||||
"""Process all source_records through deduplication and create brand entries.
|
||||
|
||||
Order matters: process higher-priority sources first so their data
|
||||
forms the base record that lower-priority sources merge into.
|
||||
"""
|
||||
db = get_db()
|
||||
|
||||
# Process in priority order (highest first)
|
||||
sources_ordered = sorted(SOURCE_PRIORITY.keys(),
|
||||
key=lambda s: SOURCE_PRIORITY[s], reverse=True)
|
||||
|
||||
stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
|
||||
|
||||
print("=" * 60)
|
||||
print("DEDUPLICATION ENGINE")
|
||||
print("=" * 60)
|
||||
|
||||
for source in sources_ordered:
|
||||
records = db.execute(
|
||||
"""SELECT id, normalized_data FROM source_record
|
||||
WHERE source_name = ? AND normalized_data IS NOT NULL""",
|
||||
(source,)
|
||||
).fetchall()
|
||||
|
||||
if not records:
|
||||
continue
|
||||
|
||||
print(f"\n Processing {source}: {len(records)} records")
|
||||
source_stats = {"new": 0, "updated": 0, "skipped": 0, "matched": 0}
|
||||
|
||||
for row in records:
|
||||
record = json.loads(row["normalized_data"])
|
||||
brand_id, match_type = find_existing_brand(db, record)
|
||||
|
||||
if match_type == "new":
|
||||
brand_id = create_brand(db, record)
|
||||
source_stats["new"] += 1
|
||||
elif match_type == "source_key":
|
||||
source_stats["skipped"] += 1
|
||||
else:
|
||||
# Matched to existing — merge
|
||||
updated = update_brand(db, brand_id, record, match_type)
|
||||
if updated:
|
||||
source_stats["updated"] += 1
|
||||
else:
|
||||
source_stats["matched"] += 1
|
||||
|
||||
# Update source_record with match info
|
||||
db.execute(
|
||||
"""UPDATE source_record
|
||||
SET matched_brand_id = ?, match_type = ?, processed_at = datetime('now')
|
||||
WHERE id = ?""",
|
||||
(brand_id, match_type, row["id"])
|
||||
)
|
||||
|
||||
db.commit()
|
||||
print(f" New: {source_stats['new']}, Updated: {source_stats['updated']}, "
|
||||
f"Matched: {source_stats['matched']}, Skipped: {source_stats['skipped']}")
|
||||
|
||||
for k, v in source_stats.items():
|
||||
stats[k] += v
|
||||
|
||||
# Final summary
|
||||
total_brands = db.execute("SELECT COUNT(*) as n FROM funeral_brand").fetchone()["n"]
|
||||
total_locations = db.execute("SELECT COUNT(*) as n FROM location").fetchone()["n"]
|
||||
total_packages = db.execute("SELECT COUNT(*) as n FROM package").fetchone()["n"]
|
||||
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"DEDUP RESULTS")
|
||||
print(f"{'=' * 60}")
|
||||
print(f" New brands created: {stats['new']}")
|
||||
print(f" Existing updated: {stats['updated']}")
|
||||
print(f" Matched (no change): {stats['matched']}")
|
||||
print(f" Skipped (source_key): {stats['skipped']}")
|
||||
print(f"\n Total brands in DB: {total_brands}")
|
||||
print(f" Total locations in DB: {total_locations}")
|
||||
print(f" Total packages in DB: {total_packages}")
|
||||
|
||||
# Show match type breakdown
|
||||
print(f"\n Match type breakdown:")
|
||||
rows = db.execute(
|
||||
"""SELECT match_type, COUNT(*) as n
|
||||
FROM source_record WHERE processed_at IS NOT NULL
|
||||
GROUP BY match_type ORDER BY n DESC"""
|
||||
).fetchall()
|
||||
for row in rows:
|
||||
print(f" {row['match_type']:15s} {row['n']:5d}")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
process_all()
|
||||
320
crawlers/discover_websites.py
Normal file
320
crawlers/discover_websites.py
Normal file
@@ -0,0 +1,320 @@
|
||||
"""Website discovery module.
|
||||
|
||||
For each provider without a website URL, attempts to find their website
|
||||
using multiple strategies (tried in order):
|
||||
|
||||
1. Serper.dev (2,500 free Google searches, no CC needed)
|
||||
2. DuckDuckGo lite (free fallback, rate-limited)
|
||||
3. URL pattern guessing (businessname.com.au)
|
||||
|
||||
Also validates discovered URLs to confirm they belong to the business.
|
||||
|
||||
Configuration:
|
||||
Set SERPER_API_KEY env var or in config.json to enable Serper.dev.
|
||||
Without it, falls back to DuckDuckGo.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
from pathlib import Path
|
||||
|
||||
from base import (
|
||||
fetch_url, get_db, normalize_phone, CRAWL_DELAY,
|
||||
)
|
||||
|
||||
# Load Serper API key from env or config
|
||||
SERPER_API_KEY = os.environ.get("SERPER_API_KEY")
|
||||
if not SERPER_API_KEY:
|
||||
config_path = Path(__file__).parent / "config.json"
|
||||
if config_path.exists():
|
||||
with open(config_path) as f:
|
||||
config = json.load(f)
|
||||
SERPER_API_KEY = config.get("serper_api_key")
|
||||
|
||||
# Domains to skip when extracting search results
|
||||
SKIP_DOMAINS = [
|
||||
"yellowpages", "whitepages", "truelocal", "yelp", "cylex",
|
||||
"australia247", "showmelocal", "hotfrog", "localsearch",
|
||||
"facebook.com", "linkedin.com", "instagram.com", "twitter.com",
|
||||
"gatheredhere", "ezifunerals", "funeralocity", "funeraldirectory",
|
||||
"deathsandfunerals", "mytributes", "obits.com",
|
||||
"duckduckgo.com", "google.com", "bing.com",
|
||||
"nfda.com.au", "funeralsaustralia.org",
|
||||
"wikipedia.org", "youtube.com",
|
||||
]
|
||||
|
||||
|
||||
def search_serper(query: str) -> list[str]:
|
||||
"""Search via Serper.dev (Google results as JSON). 2,500 free queries."""
|
||||
if not SERPER_API_KEY:
|
||||
return []
|
||||
|
||||
url = "https://google.serper.dev/search"
|
||||
data = json.dumps({"q": query, "gl": "au", "num": 10}).encode("utf-8")
|
||||
req = urllib.request.Request(url, data=data, headers={
|
||||
"X-API-KEY": SERPER_API_KEY,
|
||||
"Content-Type": "application/json",
|
||||
})
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as resp:
|
||||
result = json.loads(resp.read().decode("utf-8"))
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
results = []
|
||||
for item in result.get("organic", []):
|
||||
link = item.get("link", "")
|
||||
if not link:
|
||||
continue
|
||||
if any(d in link.lower() for d in SKIP_DOMAINS):
|
||||
continue
|
||||
results.append(link)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def search_ddg(query: str) -> list[str]:
|
||||
"""Search DuckDuckGo lite and return result URLs (filtered)."""
|
||||
encoded = urllib.parse.quote(query)
|
||||
url = f"https://lite.duckduckgo.com/lite/?q={encoded}"
|
||||
|
||||
try:
|
||||
html = fetch_url(url)
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
# Extract redirect URLs from DDG lite format
|
||||
raw_links = re.findall(
|
||||
r'href="//duckduckgo\.com/l/\?uddg=([^&"]+)', html
|
||||
)
|
||||
|
||||
results = []
|
||||
for link in raw_links:
|
||||
decoded = urllib.parse.unquote(link)
|
||||
# Skip ads
|
||||
if "ad_domain" in decoded or "ad_provider" in decoded:
|
||||
continue
|
||||
# Skip directory/aggregator sites
|
||||
if any(d in decoded.lower() for d in SKIP_DOMAINS):
|
||||
continue
|
||||
results.append(decoded)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def validate_url(url: str, business_name: str) -> dict:
|
||||
"""Validate that a URL is a real website belonging to this business.
|
||||
|
||||
Returns: {valid: bool, confidence: str, reason: str}
|
||||
"""
|
||||
try:
|
||||
html = fetch_url(url, timeout=15)
|
||||
except urllib.error.HTTPError as e:
|
||||
return {"valid": False, "confidence": "none", "reason": f"HTTP {e.code}"}
|
||||
except Exception as e:
|
||||
return {"valid": False, "confidence": "none", "reason": str(e)[:100]}
|
||||
|
||||
html_lower = html.lower()
|
||||
|
||||
# Check if it's a parked/for-sale domain
|
||||
parked_signals = ["domain is for sale", "buy this domain",
|
||||
"parked domain", "this domain", "godaddy",
|
||||
"domain parking"]
|
||||
if any(s in html_lower for s in parked_signals):
|
||||
return {"valid": False, "confidence": "none", "reason": "parked domain"}
|
||||
|
||||
# Check if the page mentions the business name
|
||||
name_parts = business_name.lower().split()
|
||||
# Require at least 2 name parts to match (or all if name is 1-2 words)
|
||||
min_matches = min(2, len(name_parts))
|
||||
matches = sum(1 for part in name_parts
|
||||
if len(part) > 2 and part in html_lower)
|
||||
|
||||
if matches >= min_matches:
|
||||
return {"valid": True, "confidence": "confirmed", "reason": "name found in page"}
|
||||
|
||||
# Check title tag
|
||||
title_match = re.search(r"<title[^>]*>(.*?)</title>", html, re.IGNORECASE | re.DOTALL)
|
||||
if title_match:
|
||||
title = title_match.group(1).lower()
|
||||
if any(part in title for part in name_parts if len(part) > 2):
|
||||
return {"valid": True, "confidence": "probable",
|
||||
"reason": "partial name in title"}
|
||||
|
||||
# Check for funeral-related content (it's at least a funeral business)
|
||||
funeral_signals = ["funeral", "cremation", "burial", "memorial",
|
||||
"chapel", "obituar", "condolence"]
|
||||
if any(s in html_lower for s in funeral_signals):
|
||||
return {"valid": True, "confidence": "probable",
|
||||
"reason": "funeral content found, name not confirmed"}
|
||||
|
||||
return {"valid": False, "confidence": "low",
|
||||
"reason": "business name not found on page"}
|
||||
|
||||
|
||||
def guess_urls(business_name: str) -> list[str]:
|
||||
"""Generate candidate URLs from a business name."""
|
||||
# Clean name for domain guessing
|
||||
slug = business_name.lower().strip()
|
||||
slug = re.sub(r"[''`]", "", slug)
|
||||
slug = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug)
|
||||
slug = re.sub(r"[^a-z0-9]+", "", slug)
|
||||
|
||||
# Also try hyphenated version
|
||||
slug_hyphen = business_name.lower().strip()
|
||||
slug_hyphen = re.sub(r"[''`]", "", slug_hyphen)
|
||||
slug_hyphen = re.sub(r"\b(pty|ltd|limited|proprietary|inc)\b", "", slug_hyphen)
|
||||
slug_hyphen = re.sub(r"[^a-z0-9]+", "-", slug_hyphen).strip("-")
|
||||
|
||||
candidates = []
|
||||
for s in [slug, slug_hyphen]:
|
||||
if s:
|
||||
candidates.append(f"https://www.{s}.com.au")
|
||||
candidates.append(f"https://{s}.com.au")
|
||||
|
||||
return candidates
|
||||
|
||||
|
||||
def discover_website(name: str, suburb: str | None, state: str | None,
|
||||
phone: str | None = None) -> dict | None:
|
||||
"""Attempt to discover a business website.
|
||||
|
||||
Returns: {url, confidence, method, validation} or None.
|
||||
"""
|
||||
# Build search query
|
||||
query_parts = [name]
|
||||
if suburb:
|
||||
query_parts.append(suburb)
|
||||
if state:
|
||||
query_parts.append(state)
|
||||
query = " ".join(query_parts)
|
||||
|
||||
# Strategy 1: Serper.dev (Google results, 2500 free)
|
||||
results = search_serper(query)
|
||||
|
||||
# Strategy 2: DuckDuckGo fallback
|
||||
if not results:
|
||||
results = search_ddg(query)
|
||||
|
||||
for url in results[:3]:
|
||||
validation = validate_url(url, name)
|
||||
if validation["valid"]:
|
||||
return {
|
||||
"url": url.rstrip("/"),
|
||||
"confidence": validation["confidence"],
|
||||
"method": "search",
|
||||
"validation": validation,
|
||||
}
|
||||
time.sleep(0.5)
|
||||
|
||||
# Strategy 2: URL guessing
|
||||
candidates = guess_urls(name)
|
||||
for url in candidates:
|
||||
try:
|
||||
validation = validate_url(url, name)
|
||||
if validation["valid"]:
|
||||
return {
|
||||
"url": url.rstrip("/"),
|
||||
"confidence": validation["confidence"],
|
||||
"method": "guess",
|
||||
"validation": validation,
|
||||
}
|
||||
except Exception:
|
||||
continue
|
||||
time.sleep(0.3)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def run(limit: int | None = None, state_filter: str | None = None):
|
||||
"""Discover websites for all providers without one.
|
||||
|
||||
Args:
|
||||
limit: Max providers to process (for testing).
|
||||
state_filter: Only process providers in this state.
|
||||
"""
|
||||
db = get_db()
|
||||
|
||||
query = """
|
||||
SELECT id, title, business_suburb, business_state, phone
|
||||
FROM funeral_brand
|
||||
WHERE website IS NULL AND verified = 0
|
||||
"""
|
||||
params = []
|
||||
|
||||
if state_filter:
|
||||
query += " AND business_state = ?"
|
||||
params.append(state_filter)
|
||||
|
||||
query += " ORDER BY id"
|
||||
|
||||
if limit:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
providers = db.execute(query, params).fetchall()
|
||||
print(f"Providers without websites: {len(providers)}")
|
||||
|
||||
found = 0
|
||||
not_found = 0
|
||||
|
||||
for i, prov in enumerate(providers):
|
||||
name = prov["title"]
|
||||
suburb = prov["business_suburb"]
|
||||
state = prov["business_state"]
|
||||
phone = prov["phone"]
|
||||
|
||||
if (i + 1) % 10 == 0 or i == 0:
|
||||
print(f" [{i+1}/{len(providers)}] Processing: {name}")
|
||||
|
||||
result = discover_website(name, suburb, state, phone)
|
||||
|
||||
if result:
|
||||
db.execute(
|
||||
"""UPDATE funeral_brand
|
||||
SET website = ?, updated_at = datetime('now')
|
||||
WHERE id = ?""",
|
||||
(result["url"], prov["id"])
|
||||
)
|
||||
found += 1
|
||||
if (i + 1) <= 20 or result["confidence"] == "confirmed":
|
||||
print(f" FOUND ({result['confidence']}, {result['method']}): "
|
||||
f"{result['url']}")
|
||||
else:
|
||||
not_found += 1
|
||||
|
||||
if (i + 1) % 20 == 0:
|
||||
db.commit()
|
||||
|
||||
# Rate limit: ~2s between providers (DDG + validation requests)
|
||||
time.sleep(CRAWL_DELAY * 2)
|
||||
|
||||
db.commit()
|
||||
print(f"\nDone: {found} websites found, {not_found} not found")
|
||||
print(f" Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = None
|
||||
state = None
|
||||
|
||||
for arg in sys.argv[1:]:
|
||||
if arg.startswith("--state="):
|
||||
state = arg.split("=")[1]
|
||||
elif arg.startswith("--limit="):
|
||||
limit = int(arg.split("=")[1])
|
||||
else:
|
||||
try:
|
||||
limit = int(arg)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run(limit=limit, state_filter=state)
|
||||
393
crawlers/enrich_websites.py
Normal file
393
crawlers/enrich_websites.py
Normal file
@@ -0,0 +1,393 @@
|
||||
"""Website enrichment module.
|
||||
|
||||
For each provider with a website but no packages yet, crawls their site
|
||||
to find pricing/packages pages and extracts structured data.
|
||||
|
||||
Two extraction modes:
|
||||
1. Direct HTML parsing (for sites with clear pricing structure)
|
||||
2. AI extraction via API call (for complex/varied layouts)
|
||||
|
||||
This module handles the crawling and page discovery.
|
||||
AI extraction is delegated to the N8N workflow (Claude Haiku node).
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.error
|
||||
from pathlib import Path
|
||||
|
||||
from base import fetch_url, get_db, CRAWL_DELAY
|
||||
|
||||
# Common URL patterns for pricing/packages pages
|
||||
PRICING_PATHS = [
|
||||
"/pricing",
|
||||
"/prices",
|
||||
"/our-prices",
|
||||
"/packages",
|
||||
"/funeral-packages",
|
||||
"/services",
|
||||
"/our-services",
|
||||
"/funeral-costs",
|
||||
"/funeral-services",
|
||||
"/service-options",
|
||||
"/price-list",
|
||||
"/transparency",
|
||||
"/funeral-pricing",
|
||||
"/costs",
|
||||
"/cremation",
|
||||
"/cremation-packages",
|
||||
"/burial",
|
||||
"/plan-a-funeral",
|
||||
"/arrange",
|
||||
]
|
||||
|
||||
# Keywords that suggest a link leads to pricing
|
||||
PRICING_KEYWORDS = [
|
||||
"pric", "cost", "packag", "service", "plan",
|
||||
"cremation", "burial", "funeral",
|
||||
"transparency", "disclosure",
|
||||
]
|
||||
|
||||
|
||||
def find_pricing_page(base_url: str, homepage_html: str) -> str | None:
|
||||
"""Try to find the pricing/packages page URL.
|
||||
|
||||
Strategy:
|
||||
1. Try common URL patterns
|
||||
2. Parse homepage links for pricing-related keywords
|
||||
"""
|
||||
base = base_url.rstrip("/")
|
||||
|
||||
# Strategy 1: Try common paths
|
||||
for path in PRICING_PATHS:
|
||||
test_url = base + path
|
||||
try:
|
||||
html = fetch_url(test_url, timeout=10)
|
||||
# Verify it's not a 404 soft-redirect (check for pricing content)
|
||||
if len(html) > 1000 and ("$" in html or "price" in html.lower()):
|
||||
return test_url
|
||||
except (urllib.error.HTTPError, urllib.error.URLError, Exception):
|
||||
continue
|
||||
time.sleep(0.3)
|
||||
|
||||
# Strategy 2: Parse homepage links
|
||||
link_pattern = re.compile(
|
||||
r'<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>',
|
||||
re.IGNORECASE | re.DOTALL
|
||||
)
|
||||
|
||||
for match in link_pattern.finditer(homepage_html):
|
||||
href = match.group(1)
|
||||
text = re.sub(r"<[^>]+>", "", match.group(2)).lower().strip()
|
||||
href_lower = href.lower()
|
||||
|
||||
# Check if link text or URL contains pricing keywords
|
||||
if any(kw in text or kw in href_lower for kw in PRICING_KEYWORDS):
|
||||
# Resolve relative URLs
|
||||
if href.startswith("/"):
|
||||
full_url = base + href
|
||||
elif href.startswith("http"):
|
||||
# Only follow links to the same domain
|
||||
if urllib.parse.urlparse(base).netloc in href:
|
||||
full_url = href
|
||||
else:
|
||||
continue
|
||||
else:
|
||||
full_url = base + "/" + href
|
||||
|
||||
try:
|
||||
html = fetch_url(full_url, timeout=10)
|
||||
if len(html) > 500:
|
||||
return full_url
|
||||
except Exception:
|
||||
continue
|
||||
time.sleep(0.3)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def extract_description(html: str) -> str | None:
|
||||
"""Extract a business description from homepage HTML."""
|
||||
# Try meta description first
|
||||
meta_match = re.search(
|
||||
r'<meta\s+(?:name="description"\s+content="([^"]+)"|content="([^"]+)"\s+name="description")',
|
||||
html, re.IGNORECASE
|
||||
)
|
||||
if meta_match:
|
||||
desc = meta_match.group(1) or meta_match.group(2)
|
||||
if desc and len(desc) > 20:
|
||||
return desc.strip()
|
||||
|
||||
# Try OG description
|
||||
og_match = re.search(
|
||||
r'<meta\s+property="og:description"\s+content="([^"]+)"',
|
||||
html, re.IGNORECASE
|
||||
)
|
||||
if og_match and len(og_match.group(1)) > 20:
|
||||
return og_match.group(1).strip()
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def extract_contact_info(html: str) -> dict:
|
||||
"""Extract contact details from HTML."""
|
||||
info = {}
|
||||
|
||||
# Phone
|
||||
phone_match = re.search(r'href="tel:([^"]+)"', html)
|
||||
if phone_match:
|
||||
info["phone"] = phone_match.group(1).strip()
|
||||
|
||||
# Email
|
||||
email_match = re.search(r'href="mailto:([^"?]+)"', html)
|
||||
if email_match:
|
||||
info["email"] = email_match.group(1).strip()
|
||||
|
||||
# Address from JSON-LD
|
||||
addr_match = re.search(r'"streetAddress"\s*:\s*"([^"]*)"', html)
|
||||
if addr_match:
|
||||
info["address"] = addr_match.group(1)
|
||||
|
||||
return info
|
||||
|
||||
|
||||
def check_has_pricing(html: str) -> bool:
|
||||
"""Quick check whether a page contains pricing information."""
|
||||
# Look for dollar signs near numbers
|
||||
price_pattern = re.compile(r'\$[\d,]+(?:\.\d{2})?')
|
||||
prices_found = price_pattern.findall(html)
|
||||
|
||||
# Filter out tiny amounts (likely not funeral pricing)
|
||||
significant_prices = []
|
||||
for p in prices_found:
|
||||
cleaned = p.replace("$", "").replace(",", "").strip()
|
||||
if not cleaned:
|
||||
continue
|
||||
try:
|
||||
amount = float(cleaned)
|
||||
except ValueError:
|
||||
continue
|
||||
if amount >= 100:
|
||||
significant_prices.append(amount)
|
||||
|
||||
return len(significant_prices) >= 1
|
||||
|
||||
|
||||
def prepare_for_ai_extraction(html: str) -> str:
|
||||
"""Clean HTML for AI extraction — remove noise, keep content."""
|
||||
# Remove script and style tags
|
||||
cleaned = re.sub(r"<script[^>]*>.*?</script>", "", html,
|
||||
flags=re.DOTALL | re.IGNORECASE)
|
||||
cleaned = re.sub(r"<style[^>]*>.*?</style>", "", cleaned,
|
||||
flags=re.DOTALL | re.IGNORECASE)
|
||||
|
||||
# Remove HTML comments
|
||||
cleaned = re.sub(r"<!--.*?-->", "", cleaned, flags=re.DOTALL)
|
||||
|
||||
# Remove nav, header, footer elements
|
||||
for tag in ["nav", "header", "footer"]:
|
||||
cleaned = re.sub(
|
||||
rf"<{tag}[^>]*>.*?</{tag}>", "", cleaned,
|
||||
flags=re.DOTALL | re.IGNORECASE
|
||||
)
|
||||
|
||||
# Strip remaining tags but keep text
|
||||
text = re.sub(r"<[^>]+>", " ", cleaned)
|
||||
# Collapse whitespace
|
||||
text = re.sub(r"\s+", " ", text).strip()
|
||||
|
||||
# Truncate to ~8000 chars (fits well within Haiku context)
|
||||
if len(text) > 8000:
|
||||
text = text[:8000] + "..."
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def enrich_provider(provider_id: int, website: str, db) -> dict:
|
||||
"""Crawl a provider's website and extract enrichment data.
|
||||
|
||||
Returns a dict with what was found.
|
||||
"""
|
||||
result = {
|
||||
"homepage_fetched": False,
|
||||
"description": None,
|
||||
"contact_info": {},
|
||||
"pricing_page_url": None,
|
||||
"has_pricing": False,
|
||||
"pricing_page_text": None, # cleaned text for AI extraction
|
||||
"pdf_links": [],
|
||||
}
|
||||
|
||||
# Step 1: Fetch homepage
|
||||
try:
|
||||
homepage = fetch_url(website, timeout=15)
|
||||
result["homepage_fetched"] = True
|
||||
except Exception as e:
|
||||
result["error"] = str(e)[:200]
|
||||
return result
|
||||
|
||||
# Step 2: Extract description and contact info
|
||||
result["description"] = extract_description(homepage)
|
||||
result["contact_info"] = extract_contact_info(homepage)
|
||||
|
||||
# Step 3: Find pricing page
|
||||
time.sleep(CRAWL_DELAY)
|
||||
pricing_url = find_pricing_page(website, homepage)
|
||||
|
||||
if pricing_url:
|
||||
result["pricing_page_url"] = pricing_url
|
||||
try:
|
||||
pricing_html = fetch_url(pricing_url, timeout=15)
|
||||
result["has_pricing"] = check_has_pricing(pricing_html)
|
||||
result["pricing_page_text"] = prepare_for_ai_extraction(pricing_html)
|
||||
|
||||
# Check for PDF links
|
||||
pdf_links = re.findall(
|
||||
r'href="([^"]*\.pdf[^"]*)"', pricing_html, re.IGNORECASE
|
||||
)
|
||||
for pdf_href in pdf_links:
|
||||
if pdf_href.startswith("/"):
|
||||
pdf_href = website.rstrip("/") + pdf_href
|
||||
elif not pdf_href.startswith("http"):
|
||||
pdf_href = website.rstrip("/") + "/" + pdf_href
|
||||
result["pdf_links"].append(pdf_href)
|
||||
|
||||
except Exception:
|
||||
pass
|
||||
else:
|
||||
# Check homepage itself for pricing
|
||||
if check_has_pricing(homepage):
|
||||
result["has_pricing"] = True
|
||||
result["pricing_page_url"] = website
|
||||
result["pricing_page_text"] = prepare_for_ai_extraction(homepage)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def run(limit: int | None = None, state_filter: str | None = None):
|
||||
"""Enrich all providers that have a website but no packages."""
|
||||
db = get_db()
|
||||
|
||||
query = """
|
||||
SELECT fb.id, fb.title, fb.website, fb.business_state
|
||||
FROM funeral_brand fb
|
||||
LEFT JOIN package p ON p.brand_id = fb.id
|
||||
WHERE fb.website IS NOT NULL
|
||||
AND fb.verified = 0
|
||||
AND p.id IS NULL
|
||||
"""
|
||||
params = []
|
||||
|
||||
if state_filter:
|
||||
query += " AND fb.business_state = ?"
|
||||
params.append(state_filter)
|
||||
|
||||
query += " ORDER BY fb.id"
|
||||
|
||||
if limit:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
providers = db.execute(query, params).fetchall()
|
||||
print(f"Providers to enrich: {len(providers)}")
|
||||
|
||||
enriched = 0
|
||||
pricing_found = 0
|
||||
failed = 0
|
||||
|
||||
for i, prov in enumerate(providers):
|
||||
if (i + 1) % 5 == 0 or i == 0:
|
||||
print(f" [{i+1}/{len(providers)}] {prov['title']}")
|
||||
|
||||
result = enrich_provider(prov["id"], prov["website"], db)
|
||||
|
||||
if not result["homepage_fetched"]:
|
||||
failed += 1
|
||||
db.execute(
|
||||
"""UPDATE funeral_brand
|
||||
SET enrichment_status = 'failed', updated_at = datetime('now')
|
||||
WHERE id = ?""",
|
||||
(prov["id"],)
|
||||
)
|
||||
continue
|
||||
|
||||
enriched += 1
|
||||
|
||||
# Update brand with discovered info
|
||||
updates = {}
|
||||
if result["description"] and not db.execute(
|
||||
"SELECT description FROM funeral_brand WHERE id = ?", (prov["id"],)
|
||||
).fetchone()["description"]:
|
||||
updates["description"] = result["description"]
|
||||
|
||||
contact = result["contact_info"]
|
||||
brand = db.execute("SELECT * FROM funeral_brand WHERE id = ?",
|
||||
(prov["id"],)).fetchone()
|
||||
if contact.get("email") and not brand["email"]:
|
||||
updates["email"] = contact["email"]
|
||||
if contact.get("phone") and not brand["phone"]:
|
||||
updates["phone"] = contact["phone"]
|
||||
|
||||
if result["has_pricing"]:
|
||||
pricing_found += 1
|
||||
updates["enrichment_status"] = "partial" # has pricing, needs AI extraction
|
||||
else:
|
||||
updates["enrichment_status"] = "partial" # homepage enriched, no pricing
|
||||
|
||||
if updates:
|
||||
set_parts = [f"{k} = ?" for k in updates]
|
||||
values = list(updates.values()) + [prov["id"]]
|
||||
db.execute(
|
||||
f"UPDATE funeral_brand SET {', '.join(set_parts)}, "
|
||||
f"updated_at = datetime('now') WHERE id = ?",
|
||||
values
|
||||
)
|
||||
|
||||
# Store pricing page text for later AI extraction
|
||||
if result["pricing_page_text"]:
|
||||
db.execute(
|
||||
"""INSERT OR REPLACE INTO source_record
|
||||
(source_name, source_id, source_url, raw_data,
|
||||
matched_brand_id, match_type)
|
||||
VALUES ('website_crawl', ?, ?, ?, ?, 'enrichment')""",
|
||||
(
|
||||
f"brand_{prov['id']}",
|
||||
result["pricing_page_url"],
|
||||
json.dumps({
|
||||
"pricing_text": result["pricing_page_text"],
|
||||
"pdf_links": result["pdf_links"],
|
||||
"has_pricing": result["has_pricing"],
|
||||
}),
|
||||
prov["id"],
|
||||
)
|
||||
)
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
db.commit()
|
||||
|
||||
time.sleep(CRAWL_DELAY)
|
||||
|
||||
db.commit()
|
||||
print(f"\nDone: {enriched} enriched, {pricing_found} with pricing, {failed} failed")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = None
|
||||
state = None
|
||||
|
||||
for arg in sys.argv[1:]:
|
||||
if arg.startswith("--state="):
|
||||
state = arg.split("=")[1]
|
||||
elif arg.startswith("--limit="):
|
||||
limit = int(arg.split("=")[1])
|
||||
else:
|
||||
try:
|
||||
limit = int(arg)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run(limit=limit, state_filter=state)
|
||||
199
crawlers/lookup_abn.py
Normal file
199
crawlers/lookup_abn.py
Normal file
@@ -0,0 +1,199 @@
|
||||
"""ABN Lookup module via the Australian Business Register (ABR) API.
|
||||
|
||||
Enriches providers with their ABN (strongest dedup key) and validates
|
||||
that they are active registered businesses.
|
||||
|
||||
The ABR API is FREE. Requires a GUID (authentication token) from:
|
||||
https://abr.business.gov.au/Tools/WebServices
|
||||
|
||||
Configuration:
|
||||
Set ABR_GUID env var or in config.json.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import urllib.parse
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
from base import fetch_url, get_db, CRAWL_DELAY
|
||||
|
||||
# Load ABR GUID from env or config
|
||||
ABR_GUID = os.environ.get("ABR_GUID")
|
||||
if not ABR_GUID:
|
||||
config_path = os.path.join(os.path.dirname(__file__), "config.json")
|
||||
if os.path.exists(config_path):
|
||||
with open(config_path) as f:
|
||||
config = json.load(f)
|
||||
ABR_GUID = config.get("abr_guid")
|
||||
|
||||
ABR_BASE = "https://abr.business.gov.au/abrxmlsearch/AbrXmlSearch.asmx"
|
||||
|
||||
|
||||
def search_by_name(name: str, state: str | None = None,
|
||||
postcode: str | None = None) -> list[dict]:
|
||||
"""Search ABR by business name. Returns matching records."""
|
||||
if not ABR_GUID:
|
||||
print(" WARNING: ABR_GUID not configured. Skipping ABN lookup.")
|
||||
return []
|
||||
|
||||
params = {
|
||||
"name": name,
|
||||
"postcode": postcode or "",
|
||||
"legalName": "Y",
|
||||
"tradingName": "Y",
|
||||
"NSW": "Y", "SA": "Y", "ACT": "Y", "VIC": "Y",
|
||||
"WA": "Y", "NT": "Y", "QLD": "Y", "TAS": "Y",
|
||||
"authenticationGuid": ABR_GUID,
|
||||
}
|
||||
|
||||
# If state specified, only search that state
|
||||
if state:
|
||||
for s in ["NSW", "SA", "ACT", "VIC", "WA", "NT", "QLD", "TAS"]:
|
||||
params[s] = "Y" if s == state else "N"
|
||||
|
||||
url = f"{ABR_BASE}/ABRSearchByNameSimpleProtocol"
|
||||
try:
|
||||
text = fetch_url(url, method="GET", data=params, timeout=15)
|
||||
except Exception as e:
|
||||
return []
|
||||
|
||||
# Parse XML response
|
||||
results = []
|
||||
try:
|
||||
root = ET.fromstring(text)
|
||||
# The ABR response uses a default namespace
|
||||
ns = {"abr": "http://abr.business.gov.au/ABRXMLSearch/"}
|
||||
|
||||
for record in root.findall(".//abr:searchResultsRecord", ns):
|
||||
abn_elem = record.find(".//abr:ABN/abr:identifierValue", ns)
|
||||
status_elem = record.find(".//abr:ABN/abr:identifierStatus", ns)
|
||||
name_elem = (
|
||||
record.find(".//abr:mainName/abr:organisationName", ns)
|
||||
or record.find(".//abr:mainTradingName/abr:organisationName", ns)
|
||||
or record.find(".//abr:businessName/abr:organisationName", ns)
|
||||
)
|
||||
state_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:stateCode", ns)
|
||||
postcode_elem = record.find(".//abr:mainBusinessPhysicalAddress/abr:postcode", ns)
|
||||
score_elem = record.find(".//abr:nameScore", ns)
|
||||
|
||||
if abn_elem is not None:
|
||||
results.append({
|
||||
"abn": abn_elem.text,
|
||||
"status": status_elem.text if status_elem is not None else None,
|
||||
"name": name_elem.text if name_elem is not None else None,
|
||||
"state": state_elem.text if state_elem is not None else None,
|
||||
"postcode": postcode_elem.text if postcode_elem is not None else None,
|
||||
"score": int(score_elem.text) if score_elem is not None else 0,
|
||||
})
|
||||
except ET.ParseError:
|
||||
return []
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def find_best_match(name: str, state: str | None = None,
|
||||
postcode: str | None = None) -> dict | None:
|
||||
"""Find the best ABR match for a business name.
|
||||
|
||||
Returns the highest-scoring active match, or None.
|
||||
"""
|
||||
results = search_by_name(name, state, postcode)
|
||||
|
||||
# Filter to active businesses
|
||||
active = [r for r in results if r.get("status") == "Active"]
|
||||
if not active:
|
||||
return None
|
||||
|
||||
# Sort by score descending
|
||||
active.sort(key=lambda r: r.get("score", 0), reverse=True)
|
||||
|
||||
# Return best match if score is reasonable
|
||||
best = active[0]
|
||||
if best.get("score", 0) >= 80:
|
||||
return best
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def run(limit: int | None = None, state_filter: str | None = None):
|
||||
"""Look up ABNs for all providers that don't have one."""
|
||||
db = get_db()
|
||||
|
||||
query = """
|
||||
SELECT id, title, business_state, business_postcode
|
||||
FROM funeral_brand
|
||||
WHERE abn IS NULL AND verified = 0
|
||||
"""
|
||||
params = []
|
||||
|
||||
if state_filter:
|
||||
query += " AND business_state = ?"
|
||||
params.append(state_filter)
|
||||
|
||||
query += " ORDER BY id"
|
||||
|
||||
if limit:
|
||||
query += f" LIMIT {limit}"
|
||||
|
||||
providers = db.execute(query, params).fetchall()
|
||||
print(f"Providers without ABN: {len(providers)}")
|
||||
|
||||
if not ABR_GUID:
|
||||
print("ERROR: ABR_GUID not configured.")
|
||||
print(" Register at: https://abr.business.gov.au/Tools/WebServices")
|
||||
print(" Then set ABR_GUID env var or add 'abr_guid' to config.json")
|
||||
return
|
||||
|
||||
found = 0
|
||||
not_found = 0
|
||||
|
||||
for i, prov in enumerate(providers):
|
||||
if (i + 1) % 20 == 0 or i == 0:
|
||||
print(f" [{i+1}/{len(providers)}] {prov['title']}")
|
||||
|
||||
match = find_best_match(
|
||||
prov["title"],
|
||||
prov["business_state"],
|
||||
prov["business_postcode"]
|
||||
)
|
||||
|
||||
if match:
|
||||
db.execute(
|
||||
"UPDATE funeral_brand SET abn = ?, updated_at = datetime('now') WHERE id = ?",
|
||||
(match["abn"], prov["id"])
|
||||
)
|
||||
found += 1
|
||||
else:
|
||||
not_found += 1
|
||||
|
||||
if (i + 1) % 50 == 0:
|
||||
db.commit()
|
||||
|
||||
time.sleep(0.5) # Be gentle with the government API
|
||||
|
||||
db.commit()
|
||||
print(f"\nDone: {found} ABNs found, {not_found} not found")
|
||||
print(f" Success rate: {found/(found+not_found)*100:.1f}%" if found + not_found > 0 else "")
|
||||
|
||||
db.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
limit = None
|
||||
state = None
|
||||
|
||||
for arg in sys.argv[1:]:
|
||||
if arg.startswith("--state="):
|
||||
state = arg.split("=")[1]
|
||||
elif arg.startswith("--limit="):
|
||||
limit = int(arg.split("=")[1])
|
||||
else:
|
||||
try:
|
||||
limit = int(arg)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
run(limit=limit, state_filter=state)
|
||||
111
crawlers/run_overnight.sh
Executable file
111
crawlers/run_overnight.sh
Executable file
@@ -0,0 +1,111 @@
|
||||
#!/bin/bash
|
||||
# Full pipeline overnight run
|
||||
# Usage: ./run_overnight.sh
|
||||
#
|
||||
# Before running:
|
||||
# 1. Add your Serper API key to config.json
|
||||
# 2. Optionally add your Anthropic API key for AI extraction
|
||||
#
|
||||
# This script runs all steps sequentially and logs everything.
|
||||
|
||||
set -e
|
||||
cd "$(dirname "$0")"
|
||||
|
||||
LOG="../logs/overnight_$(date +%Y%m%d_%H%M%S).log"
|
||||
mkdir -p ../logs
|
||||
|
||||
echo "=== OVERNIGHT PIPELINE RUN ===" | tee "$LOG"
|
||||
echo "Started: $(date)" | tee -a "$LOG"
|
||||
echo "" | tee -a "$LOG"
|
||||
|
||||
# Check config
|
||||
SERPER_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('serper_api_key') or '')")
|
||||
ANTHROPIC_KEY=$(python3 -c "import json; c=json.load(open('config.json')); print(c.get('anthropic_api_key') or '')")
|
||||
|
||||
if [ -z "$SERPER_KEY" ]; then
|
||||
echo "WARNING: No Serper API key — website discovery will use DDG (slower, lower hit rate)" | tee -a "$LOG"
|
||||
else
|
||||
echo "Serper API key: configured" | tee -a "$LOG"
|
||||
fi
|
||||
|
||||
if [ -z "$ANTHROPIC_KEY" ]; then
|
||||
echo "WARNING: No Anthropic API key — AI extraction will be skipped" | tee -a "$LOG"
|
||||
else
|
||||
echo "Anthropic API key: configured" | tee -a "$LOG"
|
||||
fi
|
||||
echo "" | tee -a "$LOG"
|
||||
|
||||
# Step 1: Source crawlers
|
||||
echo "=== STEP 1: Source Crawlers ===" | tee -a "$LOG"
|
||||
echo "[$(date +%H:%M:%S)] Running VIC Register crawler..." | tee -a "$LOG"
|
||||
python3 crawl_vic_register.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
echo "[$(date +%H:%M:%S)] Running Funerals Australia crawler..." | tee -a "$LOG"
|
||||
python3 crawl_funerals_australia.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
echo "[$(date +%H:%M:%S)] Running NFDA crawler..." | tee -a "$LOG"
|
||||
python3 crawl_nfda.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Step 2: Deduplication
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 2: Deduplication ===" | tee -a "$LOG"
|
||||
echo "[$(date +%H:%M:%S)] Running dedup..." | tee -a "$LOG"
|
||||
python3 dedup.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Step 3: Website discovery (all providers without one)
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 3: Website Discovery ===" | tee -a "$LOG"
|
||||
NEED_WEBSITE=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NULL AND verified=0').fetchone()[0])")
|
||||
echo "[$(date +%H:%M:%S)] Providers needing websites: $NEED_WEBSITE" | tee -a "$LOG"
|
||||
|
||||
# Process in batches of 200 to avoid issues
|
||||
BATCH=200
|
||||
OFFSET=0
|
||||
while [ $OFFSET -lt $NEED_WEBSITE ]; do
|
||||
REMAINING=$((NEED_WEBSITE - OFFSET))
|
||||
CURRENT=$((REMAINING < BATCH ? REMAINING : BATCH))
|
||||
echo "[$(date +%H:%M:%S)] Discovering websites batch $((OFFSET/BATCH + 1)) ($CURRENT providers)..." | tee -a "$LOG"
|
||||
python3 discover_websites.py --limit=$CURRENT 2>&1 | tee -a "$LOG"
|
||||
OFFSET=$((OFFSET + BATCH))
|
||||
# Brief pause between batches
|
||||
sleep 5
|
||||
done
|
||||
|
||||
# Step 4: Website enrichment (all with website, not yet enriched)
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 4: Website Enrichment ===" | tee -a "$LOG"
|
||||
NEED_ENRICH=$(python3 -c "from base import get_db; db=get_db(); print(db.execute('SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL AND enrichment_status=\"pending\" AND verified=0').fetchone()[0])")
|
||||
echo "[$(date +%H:%M:%S)] Providers needing enrichment: $NEED_ENRICH" | tee -a "$LOG"
|
||||
python3 enrich_websites.py --limit=$NEED_ENRICH 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Step 5: Compute tiers
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== STEP 5: Compute Tiers ===" | tee -a "$LOG"
|
||||
python3 compute_tiers.py 2>&1 | tee -a "$LOG"
|
||||
|
||||
# Final summary
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "=== FINAL SUMMARY ===" | tee -a "$LOG"
|
||||
python3 -c "
|
||||
from base import get_db
|
||||
db = get_db()
|
||||
print('Database Status:')
|
||||
print(f' Total providers: {db.execute(\"SELECT COUNT(*) FROM funeral_brand\").fetchone()[0]}')
|
||||
print(f' With phone: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE phone IS NOT NULL\").fetchone()[0]}')
|
||||
print(f' With email: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE email IS NOT NULL\").fetchone()[0]}')
|
||||
print(f' With website: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE website IS NOT NULL\").fetchone()[0]}')
|
||||
print(f' With description: {db.execute(\"SELECT COUNT(*) FROM funeral_brand WHERE description IS NOT NULL\").fetchone()[0]}')
|
||||
print()
|
||||
print('Listing Tiers:')
|
||||
for row in db.execute('SELECT listing_tier, COUNT(*) as n FROM funeral_brand GROUP BY listing_tier ORDER BY n DESC'):
|
||||
print(f' {row[0]:12s} {row[1]:>6d}')
|
||||
print()
|
||||
print('Pricing Pages:')
|
||||
print(f' Total crawled: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\'\").fetchone()[0]}')
|
||||
print(f' With pricing: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.has_pricing\\')=1\").fetchone()[0]}')
|
||||
print(f' With PDF links: {db.execute(\"SELECT COUNT(*) FROM source_record WHERE source_name=\\'website_crawl\\' AND json_extract(raw_data, \\'$.pdf_links\\') != \\'[]\\'\").fetchone()[0]}')
|
||||
" 2>&1 | tee -a "$LOG"
|
||||
|
||||
echo "" | tee -a "$LOG"
|
||||
echo "Finished: $(date)" | tee -a "$LOG"
|
||||
echo "Log saved to: $LOG"
|
||||
69
database/IMAGE-MAPPING.md
Normal file
69
database/IMAGE-MAPPING.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Image Assets & Verified Provider Mapping
|
||||
|
||||
## Image Directory Structure
|
||||
|
||||
All images are downloaded locally in `images/` with the following structure:
|
||||
|
||||
```
|
||||
images/
|
||||
├── manifest.json # Full index mapping CMS IDs → local paths
|
||||
├── providers/{slug}/ # 12 verified brands
|
||||
│ ├── logo.{ext} # Rectangular/stacked logo
|
||||
│ └── badge.{ext} # Circular/square badge (for cards)
|
||||
├── funeral-homes/{slug}/ # 7 parent organisations
|
||||
│ └── logo.{ext}
|
||||
├── locations/{slug}/ # 20 physical offices
|
||||
│ └── photo.{ext} # Building/staff hero photo
|
||||
├── coffins/{category}/ # 201 coffins by range
|
||||
│ └── {slug}/01.{ext} # 1-4 images per coffin
|
||||
├── venues/{slug}/ # 1,678 service venues
|
||||
│ └── 01.{ext}
|
||||
└── crematoriums/{slug}/ # 38 crematoriums
|
||||
└── 01.{ext}
|
||||
```
|
||||
|
||||
## Verified Brand → Image Mapping
|
||||
|
||||
These are the 12 existing verified brands from the CMS, with their image paths:
|
||||
|
||||
| CMS ID | Brand | Logo | Badge |
|
||||
|--------|-------|------|-------|
|
||||
| 1 | H.Parsons Funeral Directors | `providers/hparsons-funeral-directors/logo.png` | `providers/hparsons-funeral-directors/badge.png` |
|
||||
| 3 | Rankins Funerals | `providers/rankins-funerals/logo.webp` | `providers/rankins-funerals/badge.png` |
|
||||
| 4 | Parsons Ladies Funeral Directors | `providers/parsons-ladies-funeral-directors/logo.png` | `providers/parsons-ladies-funeral-directors/badge.png` |
|
||||
| 5 | Wollongong City Funerals | `providers/wollongong-city-funerals/logo.webp` | `providers/wollongong-city-funerals/badge.png` |
|
||||
| 6 | Easy Funerals | `providers/easy-funerals/logo.webp` | `providers/easy-funerals/badge.png` |
|
||||
| 7 | Mackay Family Funerals | `providers/mackay-family-funerals/logo.webp` | `providers/mackay-family-funerals/badge.png` |
|
||||
| 8 | H.Parsons Shoalhaven | `providers/hparsons-funeral-directors-shoalhaven/logo.png` | `providers/hparsons-funeral-directors-shoalhaven/badge.png` |
|
||||
| 9 | Killick Family Funerals | `providers/killick-family-funerals/logo.webp` | `providers/killick-family-funerals/badge.png` |
|
||||
| 10 | Kenneally's Funerals | `providers/kenneallys-funerals/logo.webp` | `providers/kenneallys-funerals/badge.png` |
|
||||
| 11 | Lady Anne Funerals | `providers/lady-anne-funerals/logo.webp` | `providers/lady-anne-funerals/badge.png` |
|
||||
| 12 | Mannings Funerals | `providers/mannings-funerals/logo.webp` | `providers/mannings-funerals/badge.png` |
|
||||
| 13 | Botanical Funerals | `providers/botanical-funerals-by-ian-allison/logo.webp` | `providers/botanical-funerals-by-ian-allison/badge.png` |
|
||||
|
||||
## How to Use on the Demo Site
|
||||
|
||||
### For verified providers:
|
||||
- Serve images from `images/providers/{slug}/` for logos and badges
|
||||
- Serve location photos from `images/locations/{slug}/`
|
||||
- Serve product images from `images/coffins/`, `images/venues/`, `images/crematoriums/`
|
||||
- The `manifest.json` contains the full mapping from CMS record IDs to local file paths
|
||||
|
||||
### For unverified providers:
|
||||
- **No images** — they have no logo, badge, or photos
|
||||
- Use a generic placeholder or text-based display (business name initials, etc.)
|
||||
- Images are only added when a provider signs up to become verified
|
||||
|
||||
### Importing verified brands:
|
||||
The 12 verified brands need to be imported into the database with their full data from
|
||||
`schemas/brands-full.json` (brand details, locations, packages, inclusions) and linked
|
||||
to their images. Some of these brands were also discovered by the crawler and already
|
||||
exist in `providers.db` as unverified — they should be **upgraded** (set `verified = true`,
|
||||
add images) rather than duplicated.
|
||||
|
||||
### Product images:
|
||||
- 201 coffins with 1-4 images each, organised by range (solid-timber, custom-board, etc.)
|
||||
- 1,678 venue photos
|
||||
- 38 crematorium photos
|
||||
- These are only relevant for verified provider flows (arrangement booking)
|
||||
- The `manifest.json` maps each product's CMS ID to its local image path
|
||||
209
database/PROVIDER-SCHEMA-SPEC.md
Normal file
209
database/PROVIDER-SCHEMA-SPEC.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# Provider Data Model — Verified & Unverified Providers
|
||||
|
||||
This document extends the CMS schema (`schemas/cms-schema-spec.md`) with support for
|
||||
unverified (auto-discovered) providers alongside the existing verified (signed-up) providers.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The platform lists funeral directors in two categories:
|
||||
|
||||
- **Verified providers** — Signed up to the platform. Full branding (logo, badge, colours),
|
||||
complete package configuration, and online arrangement booking enabled.
|
||||
- **Unverified providers** — Auto-discovered from public registries and their own websites.
|
||||
Listed with whatever public information is available. Can apply to become verified.
|
||||
|
||||
All providers share the same `funeral_brand` table and schema. The difference is driven
|
||||
by data completeness and the `verified` / `listing_tier` fields.
|
||||
|
||||
---
|
||||
|
||||
## Schema Changes to FuneralBrand
|
||||
|
||||
These fields are **added** to the existing FuneralBrand collection from `cms-schema-spec.md`:
|
||||
|
||||
| Field | Type | Default | Purpose |
|
||||
|-------|------|---------|---------|
|
||||
| `verified` | Boolean | `false` | `true` for signed-up partners, `false` for auto-discovered |
|
||||
| `listing_tier` | Enum | `'listed'` | Display tier, computed from data quality (see below) |
|
||||
| `hidden` | Boolean | `true` | Unverified providers start hidden until admin-reviewed |
|
||||
| `source_key` | String (unique) | `null` | Provenance identifier, e.g. `"nfda:1234"` |
|
||||
| `source_url` | String (URL) | `null` | Where this record was discovered |
|
||||
| `last_enriched_at` | DateTime | `null` | When data was last refreshed from provider's website |
|
||||
| `enrichment_status` | Enum | `'pending'` | `pending` / `partial` / `complete` / `failed` |
|
||||
|
||||
### Fields that become optional for unverified providers
|
||||
|
||||
These fields are **required** for verified providers but **nullable** for unverified:
|
||||
|
||||
| Field | Verified | Unverified |
|
||||
|-------|----------|------------|
|
||||
| `logo` | Required (brand logo image) | `null` — no images until they sign up |
|
||||
| `badge` | Required (card badge image) | `null` — no images until they sign up |
|
||||
| `description` | Required | Optional (extracted from their website if available) |
|
||||
| `backgroundColour` | Set (brand theme) | `null` — use platform default |
|
||||
| `foregroundColour` | Set (brand theme) | `null` — use platform default |
|
||||
| `modalDescription` | Set | `null` |
|
||||
| `code` | Set (URL slug) | Auto-generated from business name |
|
||||
|
||||
### Fields present for both verified and unverified
|
||||
|
||||
| Field | Notes |
|
||||
|-------|-------|
|
||||
| `title` | Business name (always present) |
|
||||
| `phone` | Contact phone (present for ~94% of providers) |
|
||||
| `email` | Contact email (present for ~66%) |
|
||||
| `website` | External website URL (present for ~68%) |
|
||||
| `abn` | Australian Business Number (strongest dedup key) |
|
||||
| `businessAddress/Suburb/State/Postcode` | Business location |
|
||||
| `availableFuneralTypes` | Comma-separated funeral type IDs |
|
||||
|
||||
---
|
||||
|
||||
## Listing Tiers
|
||||
|
||||
Every provider is assigned a `listing_tier` that determines how they appear on the platform.
|
||||
The tier is **computed from data quality** — specifically from what package/pricing data exists.
|
||||
|
||||
| Tier | Value | Criteria | UI Treatment |
|
||||
|------|-------|----------|-------------|
|
||||
| **Verified** | `'verified'` | `verified = true` | Full branding, package selection, online arrangements, custom images |
|
||||
| **Priced** | `'priced'` | Unverified + 2 or more packages with itemized inclusion prices | Show packages with line-item breakdowns, no arrangements |
|
||||
| **Estimated** | `'estimated'` | Unverified + at least 1 package with a total price | Show package prices, "Contact for full details" on breakdowns |
|
||||
| **Listed** | `'listed'` | Unverified + no pricing data | Show contact info only, "Contact for pricing" CTA |
|
||||
|
||||
### Tier computation logic
|
||||
|
||||
```
|
||||
if brand.verified:
|
||||
tier = 'verified'
|
||||
elif brand has 2+ packages, each with 2+ priced inclusions:
|
||||
tier = 'priced'
|
||||
elif brand has 1+ packages with any price:
|
||||
tier = 'estimated'
|
||||
else:
|
||||
tier = 'listed'
|
||||
```
|
||||
|
||||
### Upgrade incentive
|
||||
|
||||
Each tier below verified creates a natural CTA for the provider:
|
||||
- `listed` → "Publish your pricing to help families compare"
|
||||
- `estimated` → "Add detailed breakdowns to stand out"
|
||||
- `priced` → "Sign up to enable online arrangements and add your branding"
|
||||
|
||||
---
|
||||
|
||||
## Data Relationships (unchanged from CMS spec, but applied to both tiers)
|
||||
|
||||
```
|
||||
FuneralBrand (verified or unverified)
|
||||
├── Location[] (physical offices — at least 1 per provider)
|
||||
├── Package[] (funeral plan bundles — 0 for 'listed' tier)
|
||||
│ └── PackageInclusion[] (fee line items — 0 for 'estimated' tier)
|
||||
├── KnownFor[] (feature badges — verified only typically)
|
||||
└── FuneralArea[] (service regions — M:N)
|
||||
```
|
||||
|
||||
### Package (same schema as CMS spec, with additions)
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `id` | PK | |
|
||||
| `title` | String | e.g. "Direct Cremation", "Chapel Service" |
|
||||
| `description` | Text | What's included |
|
||||
| `funeral_type` | Enum | `Service & Cremation`, `Service & Burial`, `Cremation Only`, `Graveside Burial`, `Water Cremation` |
|
||||
| `brand_id` | FK → FuneralBrand | |
|
||||
| `source_url` | String | Where this pricing was found (provider's website) |
|
||||
| `extraction_confidence` | Float 0-1 | How reliable the extracted data is (0.7 = HTML, 0.6 = PDF) |
|
||||
| `sort` | Integer | Display order |
|
||||
| `hidden` | Boolean | |
|
||||
|
||||
### PackageInclusion (same schema as CMS spec)
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `id` | PK | |
|
||||
| `price` | Decimal | Dollar amount |
|
||||
| `optional` | Boolean | User can opt in/out |
|
||||
| `complimentary` | Boolean | Included free |
|
||||
| `display` | Boolean | Whether shown to user |
|
||||
| `inclusion_type_title` | String | Category label (see standard types below) |
|
||||
| `package_id` | FK → Package | |
|
||||
|
||||
### Standard inclusion type names
|
||||
|
||||
These are the consistent labels used across all providers:
|
||||
|
||||
**Standard fees:** Professional Service Fee, Transportation Service Fee, Professional Mortuary Care, Death Registration Certificate, Cremation Certificate/Permit, Government Levy, Accommodation
|
||||
|
||||
**Products:** Coffin, Cremation Fee, Cemetery Fee, Celebrant Fee
|
||||
|
||||
**Optional extras:** Saturday Service Fee, Twilight Service Surcharge, Viewing Fee, After Hours Transfer Surcharge, Dressing Fee, Embalming, Digital Recording, Webstreaming, Coffin Bearing by Funeral Directors
|
||||
|
||||
---
|
||||
|
||||
## Current Data
|
||||
|
||||
The database (`database/providers.db`, SQLite) contains:
|
||||
|
||||
| Metric | Count |
|
||||
|--------|-------|
|
||||
| Total providers | 1,463 |
|
||||
| With phone | 1,380 (94%) |
|
||||
| With email | 972 (66%) |
|
||||
| With website | 994 (68%) |
|
||||
| With description | 618 (42%) |
|
||||
| Total packages | 416 |
|
||||
| Total inclusions | 388 |
|
||||
|
||||
### Tier distribution
|
||||
|
||||
| Tier | Providers |
|
||||
|------|-----------|
|
||||
| Verified | 0 (existing 12 brands not yet imported as verified) |
|
||||
| Priced | 10 |
|
||||
| Estimated | 111 |
|
||||
| Listed | 1,342 |
|
||||
|
||||
### State distribution
|
||||
|
||||
| State | Providers | With Pricing |
|
||||
|-------|-----------|-------------|
|
||||
| VIC | 701 | 77 |
|
||||
| NSW | 269 | 8 |
|
||||
| QLD | 151 | 21 |
|
||||
| SA | 85 | 1 |
|
||||
| WA | 79 | 12 |
|
||||
| TAS | 25 | 0 |
|
||||
| NT | 7 | 0 |
|
||||
| ACT | 9 | 0 |
|
||||
|
||||
---
|
||||
|
||||
## Database Schema Files
|
||||
|
||||
- **`database/schema.sql`** — Full Postgres schema (production-ready)
|
||||
- **`database/schema_sqlite.sql`** — SQLite schema (dev/demo)
|
||||
- **`database/providers.db`** — Live SQLite database with 1,463 providers
|
||||
- **`database/seed_verified.sql`** — Script to mark imported CMS brands as verified
|
||||
|
||||
The schema is designed to be **additive** to the existing CMS schema from `schemas/cms-schema-spec.md`.
|
||||
The original 12 verified brands and their packages/products should be imported first, then
|
||||
`seed_verified.sql` marks them as `verified = true, listing_tier = 'verified'`.
|
||||
|
||||
---
|
||||
|
||||
## Verified Provider Upgrade Path
|
||||
|
||||
When an unverified provider applies to become verified:
|
||||
|
||||
1. They claim their listing (email verification or ABN match)
|
||||
2. They fill in missing fields: description, logo, badge, brand colours
|
||||
3. They configure packages with full inclusion breakdowns
|
||||
4. They enable arrangement booking
|
||||
5. Admin approves → `verified = true, listing_tier = 'verified'`
|
||||
|
||||
The backend should support this flow — updating an existing unverified brand
|
||||
record rather than creating a new one.
|
||||
BIN
database/providers.db
Normal file
BIN
database/providers.db
Normal file
Binary file not shown.
285
database/schema.sql
Normal file
285
database/schema.sql
Normal file
@@ -0,0 +1,285 @@
|
||||
-- Provider Discovery Pipeline - Database Schema
|
||||
-- Designed for Postgres. Compatible with SilverStripe CMS adaptation.
|
||||
--
|
||||
-- This schema covers the provider-facing tables needed for both
|
||||
-- verified (signed-up) and unverified (auto-discovered) providers.
|
||||
-- Product catalog tables (coffins, venues, etc.) are NOT included here —
|
||||
-- those only apply to verified providers and live in the main CMS.
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- ============================================================
|
||||
-- ENUMS
|
||||
-- ============================================================
|
||||
|
||||
CREATE TYPE enrichment_status AS ENUM ('pending', 'partial', 'complete', 'failed');
|
||||
|
||||
-- Listing tier determines how a provider appears on the platform.
|
||||
-- Computed from data quality: verified status + packages + inclusions.
|
||||
CREATE TYPE listing_tier AS ENUM (
|
||||
'verified', -- Tier 1: Signed up, full branding, arrangements enabled
|
||||
'priced', -- Tier 2: Unverified, 2+ packages with itemized inclusion prices
|
||||
'estimated', -- Tier 3: Unverified, at least one total package price
|
||||
'listed' -- Tier 4: Unverified, contact info only, no pricing
|
||||
);
|
||||
|
||||
CREATE TYPE funeral_type_enum AS ENUM (
|
||||
'Service & Cremation',
|
||||
'Service & Burial',
|
||||
'Cremation Only',
|
||||
'Graveside Burial',
|
||||
'Water Cremation'
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- 1. FUNERAL HOME (parent organisation)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE funeral_home (
|
||||
id SERIAL PRIMARY KEY,
|
||||
title TEXT NOT NULL,
|
||||
website TEXT,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- 2. FUNERAL BRAND (customer-facing provider)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE funeral_brand (
|
||||
id SERIAL PRIMARY KEY,
|
||||
title TEXT NOT NULL,
|
||||
description TEXT,
|
||||
modal_description TEXT,
|
||||
email TEXT,
|
||||
phone TEXT,
|
||||
website TEXT,
|
||||
abn TEXT,
|
||||
code TEXT UNIQUE, -- URL slug (e.g. "hparsons")
|
||||
sort INTEGER DEFAULT 0,
|
||||
hidden BOOLEAN NOT NULL DEFAULT TRUE, -- unverified start hidden
|
||||
|
||||
-- Address
|
||||
business_address TEXT,
|
||||
business_suburb TEXT,
|
||||
business_state TEXT,
|
||||
business_postcode TEXT,
|
||||
|
||||
-- Branding (nullable — unverified providers have no images)
|
||||
background_colour TEXT,
|
||||
foreground_colour TEXT,
|
||||
|
||||
-- Organisation
|
||||
funeral_home_id INTEGER REFERENCES funeral_home(id) ON DELETE SET NULL,
|
||||
|
||||
-- Verified vs auto-discovered
|
||||
verified BOOLEAN NOT NULL DEFAULT FALSE,
|
||||
|
||||
-- Provenance tracking
|
||||
source_key TEXT UNIQUE, -- "{source}:{externalId}" for dedup
|
||||
source_url TEXT, -- where this record was found
|
||||
last_enriched_at TIMESTAMPTZ,
|
||||
enrichment_status enrichment_status NOT NULL DEFAULT 'pending',
|
||||
|
||||
-- Listing tier (computed from data quality)
|
||||
listing_tier listing_tier NOT NULL DEFAULT 'listed',
|
||||
|
||||
-- Funeral types offered (comma-separated IDs, same as existing CMS)
|
||||
available_funeral_types TEXT,
|
||||
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Deduplication indexes
|
||||
CREATE INDEX idx_brand_abn ON funeral_brand(abn) WHERE abn IS NOT NULL;
|
||||
CREATE INDEX idx_brand_listing_tier ON funeral_brand(listing_tier);
|
||||
CREATE INDEX idx_brand_source_key ON funeral_brand(source_key) WHERE source_key IS NOT NULL;
|
||||
CREATE INDEX idx_brand_name_postcode ON funeral_brand(title, business_postcode);
|
||||
CREATE INDEX idx_brand_verified ON funeral_brand(verified);
|
||||
CREATE INDEX idx_brand_hidden ON funeral_brand(hidden);
|
||||
CREATE INDEX idx_brand_enrichment ON funeral_brand(enrichment_status) WHERE verified = FALSE;
|
||||
|
||||
-- ============================================================
|
||||
-- 3. LOCATION (physical office/chapel)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE location (
|
||||
id SERIAL PRIMARY KEY,
|
||||
title TEXT NOT NULL, -- display name (e.g. "Kingaroy, QLD")
|
||||
address TEXT,
|
||||
suburb TEXT,
|
||||
state TEXT,
|
||||
postcode TEXT,
|
||||
country TEXT DEFAULT 'Australia',
|
||||
lat DOUBLE PRECISION,
|
||||
lng DOUBLE PRECISION,
|
||||
rating REAL, -- Google rating 0-5
|
||||
rating_num INTEGER, -- number of Google reviews
|
||||
google_place_key TEXT, -- Google Places ID
|
||||
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE,
|
||||
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_location_brand ON location(brand_id);
|
||||
CREATE INDEX idx_location_state ON location(state);
|
||||
CREATE INDEX idx_location_postcode ON location(postcode);
|
||||
CREATE INDEX idx_location_coords ON location(lat, lng);
|
||||
CREATE INDEX idx_location_google ON location(google_place_key) WHERE google_place_key IS NOT NULL;
|
||||
|
||||
-- ============================================================
|
||||
-- 4. FUNERAL AREA (service region)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE funeral_area (
|
||||
id SERIAL PRIMARY KEY,
|
||||
title TEXT NOT NULL,
|
||||
code TEXT,
|
||||
description TEXT,
|
||||
postcodes TEXT, -- comma-separated postcode list
|
||||
sort INTEGER DEFAULT 0,
|
||||
hidden BOOLEAN DEFAULT FALSE,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Junction: brand <-> funeral_area
|
||||
CREATE TABLE brand_funeral_area (
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE,
|
||||
funeral_area_id INTEGER NOT NULL REFERENCES funeral_area(id) ON DELETE CASCADE,
|
||||
PRIMARY KEY (brand_id, funeral_area_id)
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- 5. PACKAGE (funeral plan bundle)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE package (
|
||||
id SERIAL PRIMARY KEY,
|
||||
title TEXT NOT NULL,
|
||||
description TEXT,
|
||||
sort INTEGER DEFAULT 0,
|
||||
hidden BOOLEAN DEFAULT FALSE,
|
||||
for_whom TEXT, -- 'myself' / 'someone' / null (both)
|
||||
religion TEXT, -- comma-separated supported religions
|
||||
funeral_type funeral_type_enum,
|
||||
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE,
|
||||
|
||||
-- Provenance (for AI-extracted packages)
|
||||
source_url TEXT, -- page this was extracted from
|
||||
extraction_confidence REAL, -- 0-1 confidence score from AI
|
||||
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_package_brand ON package(brand_id);
|
||||
CREATE INDEX idx_package_type ON package(funeral_type);
|
||||
|
||||
-- Junction: package <-> funeral_area
|
||||
CREATE TABLE package_funeral_area (
|
||||
package_id INTEGER NOT NULL REFERENCES package(id) ON DELETE CASCADE,
|
||||
funeral_area_id INTEGER NOT NULL REFERENCES funeral_area(id) ON DELETE CASCADE,
|
||||
PRIMARY KEY (package_id, funeral_area_id)
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- 6. PACKAGE INCLUSION (fee line item within a package)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE package_inclusion (
|
||||
id SERIAL PRIMARY KEY,
|
||||
price NUMERIC(10,2) NOT NULL,
|
||||
optional BOOLEAN NOT NULL DEFAULT FALSE,
|
||||
complimentary BOOLEAN NOT NULL DEFAULT FALSE,
|
||||
display BOOLEAN NOT NULL DEFAULT TRUE,
|
||||
description TEXT,
|
||||
sort INTEGER DEFAULT 0,
|
||||
inclusion_type_title TEXT NOT NULL, -- category label (e.g. "Professional Service Fee")
|
||||
|
||||
package_id INTEGER NOT NULL REFERENCES package(id) ON DELETE CASCADE,
|
||||
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_inclusion_package ON package_inclusion(package_id);
|
||||
|
||||
-- ============================================================
|
||||
-- 7. KNOWN FOR (feature badges on provider cards)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE known_for (
|
||||
id SERIAL PRIMARY KEY,
|
||||
title TEXT NOT NULL,
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE
|
||||
);
|
||||
|
||||
CREATE INDEX idx_known_for_brand ON known_for(brand_id);
|
||||
|
||||
-- ============================================================
|
||||
-- 8. SOURCE LOG (audit trail of scrape runs)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE source_log (
|
||||
id SERIAL PRIMARY KEY,
|
||||
source_name TEXT NOT NULL, -- 'vic_register', 'gathered_here', 'nfda', 'funerals_australia'
|
||||
run_started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
run_finished_at TIMESTAMPTZ,
|
||||
records_found INTEGER DEFAULT 0,
|
||||
records_new INTEGER DEFAULT 0,
|
||||
records_updated INTEGER DEFAULT 0,
|
||||
records_skipped INTEGER DEFAULT 0,
|
||||
status TEXT DEFAULT 'running', -- 'running', 'completed', 'failed'
|
||||
error_message TEXT,
|
||||
metadata JSONB -- any extra run info
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- 9. SOURCE RECORD (raw scraped data, kept for audit)
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE source_record (
|
||||
id SERIAL PRIMARY KEY,
|
||||
source_name TEXT NOT NULL,
|
||||
source_id TEXT NOT NULL, -- external ID from the source
|
||||
source_url TEXT,
|
||||
raw_data JSONB NOT NULL, -- original scraped data
|
||||
normalized_data JSONB, -- mapped to intermediate format
|
||||
matched_brand_id INTEGER REFERENCES funeral_brand(id) ON DELETE SET NULL,
|
||||
match_type TEXT, -- 'source_key', 'abn', 'name_postcode', 'fuzzy', 'new'
|
||||
processed_at TIMESTAMPTZ,
|
||||
log_id INTEGER REFERENCES source_log(id) ON DELETE SET NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
|
||||
UNIQUE(source_name, source_id)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_source_record_source ON source_record(source_name, source_id);
|
||||
CREATE INDEX idx_source_record_brand ON source_record(matched_brand_id) WHERE matched_brand_id IS NOT NULL;
|
||||
|
||||
-- ============================================================
|
||||
-- UPDATED_AT TRIGGER
|
||||
-- ============================================================
|
||||
|
||||
CREATE OR REPLACE FUNCTION update_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = NOW();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE TRIGGER trg_funeral_home_updated BEFORE UPDATE ON funeral_home FOR EACH ROW EXECUTE FUNCTION update_updated_at();
|
||||
CREATE TRIGGER trg_funeral_brand_updated BEFORE UPDATE ON funeral_brand FOR EACH ROW EXECUTE FUNCTION update_updated_at();
|
||||
CREATE TRIGGER trg_location_updated BEFORE UPDATE ON location FOR EACH ROW EXECUTE FUNCTION update_updated_at();
|
||||
CREATE TRIGGER trg_funeral_area_updated BEFORE UPDATE ON funeral_area FOR EACH ROW EXECUTE FUNCTION update_updated_at();
|
||||
CREATE TRIGGER trg_package_updated BEFORE UPDATE ON package FOR EACH ROW EXECUTE FUNCTION update_updated_at();
|
||||
CREATE TRIGGER trg_package_inclusion_updated BEFORE UPDATE ON package_inclusion FOR EACH ROW EXECUTE FUNCTION update_updated_at();
|
||||
|
||||
COMMIT;
|
||||
221
database/schema_sqlite.sql
Normal file
221
database/schema_sqlite.sql
Normal file
@@ -0,0 +1,221 @@
|
||||
-- Provider Discovery Pipeline - SQLite Schema (for local dev/testing)
|
||||
-- Production uses Postgres (see schema.sql)
|
||||
|
||||
-- ============================================================
|
||||
-- FUNERAL HOME
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS funeral_home (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
website TEXT,
|
||||
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- FUNERAL BRAND
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS funeral_brand (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
description TEXT,
|
||||
modal_description TEXT,
|
||||
email TEXT,
|
||||
phone TEXT,
|
||||
website TEXT,
|
||||
abn TEXT,
|
||||
code TEXT UNIQUE,
|
||||
sort INTEGER DEFAULT 0,
|
||||
hidden INTEGER NOT NULL DEFAULT 1,
|
||||
|
||||
business_address TEXT,
|
||||
business_suburb TEXT,
|
||||
business_state TEXT,
|
||||
business_postcode TEXT,
|
||||
|
||||
background_colour TEXT,
|
||||
foreground_colour TEXT,
|
||||
|
||||
funeral_home_id INTEGER REFERENCES funeral_home(id) ON DELETE SET NULL,
|
||||
|
||||
verified INTEGER NOT NULL DEFAULT 0,
|
||||
source_key TEXT UNIQUE,
|
||||
source_url TEXT,
|
||||
last_enriched_at TEXT,
|
||||
enrichment_status TEXT NOT NULL DEFAULT 'pending' CHECK(enrichment_status IN ('pending','partial','complete','failed')),
|
||||
|
||||
-- Listing tier: verified | priced | estimated | listed
|
||||
listing_tier TEXT NOT NULL DEFAULT 'listed'
|
||||
CHECK(listing_tier IN ('verified','priced','estimated','listed')),
|
||||
|
||||
available_funeral_types TEXT,
|
||||
|
||||
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_brand_abn ON funeral_brand(abn);
|
||||
CREATE INDEX IF NOT EXISTS idx_brand_source_key ON funeral_brand(source_key);
|
||||
CREATE INDEX IF NOT EXISTS idx_brand_listing_tier ON funeral_brand(listing_tier);
|
||||
CREATE INDEX IF NOT EXISTS idx_brand_name_postcode ON funeral_brand(title, business_postcode);
|
||||
CREATE INDEX IF NOT EXISTS idx_brand_verified ON funeral_brand(verified);
|
||||
CREATE INDEX IF NOT EXISTS idx_brand_hidden ON funeral_brand(hidden);
|
||||
|
||||
-- ============================================================
|
||||
-- LOCATION
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS location (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
address TEXT,
|
||||
suburb TEXT,
|
||||
state TEXT,
|
||||
postcode TEXT,
|
||||
country TEXT DEFAULT 'Australia',
|
||||
lat REAL,
|
||||
lng REAL,
|
||||
rating REAL,
|
||||
rating_num INTEGER,
|
||||
google_place_key TEXT,
|
||||
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE,
|
||||
|
||||
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_location_brand ON location(brand_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_location_postcode ON location(postcode);
|
||||
|
||||
-- ============================================================
|
||||
-- FUNERAL AREA
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS funeral_area (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
code TEXT,
|
||||
description TEXT,
|
||||
postcodes TEXT,
|
||||
sort INTEGER DEFAULT 0,
|
||||
hidden INTEGER DEFAULT 0,
|
||||
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS brand_funeral_area (
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE,
|
||||
funeral_area_id INTEGER NOT NULL REFERENCES funeral_area(id) ON DELETE CASCADE,
|
||||
PRIMARY KEY (brand_id, funeral_area_id)
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- PACKAGE
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS package (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
description TEXT,
|
||||
sort INTEGER DEFAULT 0,
|
||||
hidden INTEGER DEFAULT 0,
|
||||
for_whom TEXT,
|
||||
religion TEXT,
|
||||
funeral_type TEXT CHECK(funeral_type IN (
|
||||
'Service & Cremation','Service & Burial','Cremation Only',
|
||||
'Graveside Burial','Water Cremation'
|
||||
)),
|
||||
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE,
|
||||
|
||||
source_url TEXT,
|
||||
extraction_confidence REAL,
|
||||
|
||||
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_package_brand ON package(brand_id);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS package_funeral_area (
|
||||
package_id INTEGER NOT NULL REFERENCES package(id) ON DELETE CASCADE,
|
||||
funeral_area_id INTEGER NOT NULL REFERENCES funeral_area(id) ON DELETE CASCADE,
|
||||
PRIMARY KEY (package_id, funeral_area_id)
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- PACKAGE INCLUSION
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS package_inclusion (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
price REAL NOT NULL,
|
||||
optional INTEGER NOT NULL DEFAULT 0,
|
||||
complimentary INTEGER NOT NULL DEFAULT 0,
|
||||
display INTEGER NOT NULL DEFAULT 1,
|
||||
description TEXT,
|
||||
sort INTEGER DEFAULT 0,
|
||||
inclusion_type_title TEXT NOT NULL,
|
||||
|
||||
package_id INTEGER NOT NULL REFERENCES package(id) ON DELETE CASCADE,
|
||||
|
||||
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_inclusion_package ON package_inclusion(package_id);
|
||||
|
||||
-- ============================================================
|
||||
-- KNOWN FOR
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS known_for (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
brand_id INTEGER NOT NULL REFERENCES funeral_brand(id) ON DELETE CASCADE
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_known_for_brand ON known_for(brand_id);
|
||||
|
||||
-- ============================================================
|
||||
-- SOURCE LOG
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS source_log (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
source_name TEXT NOT NULL,
|
||||
run_started_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
run_finished_at TEXT,
|
||||
records_found INTEGER DEFAULT 0,
|
||||
records_new INTEGER DEFAULT 0,
|
||||
records_updated INTEGER DEFAULT 0,
|
||||
records_skipped INTEGER DEFAULT 0,
|
||||
status TEXT DEFAULT 'running',
|
||||
error_message TEXT,
|
||||
metadata TEXT -- JSON string
|
||||
);
|
||||
|
||||
-- ============================================================
|
||||
-- SOURCE RECORD
|
||||
-- ============================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS source_record (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
source_name TEXT NOT NULL,
|
||||
source_id TEXT NOT NULL,
|
||||
source_url TEXT,
|
||||
raw_data TEXT NOT NULL, -- JSON string
|
||||
normalized_data TEXT, -- JSON string
|
||||
matched_brand_id INTEGER REFERENCES funeral_brand(id) ON DELETE SET NULL,
|
||||
match_type TEXT,
|
||||
processed_at TEXT,
|
||||
log_id INTEGER REFERENCES source_log(id) ON DELETE SET NULL,
|
||||
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
|
||||
UNIQUE(source_name, source_id)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_source_record_source ON source_record(source_name, source_id);
|
||||
24
database/seed_verified.sql
Normal file
24
database/seed_verified.sql
Normal file
@@ -0,0 +1,24 @@
|
||||
-- Seed script: Mark existing brands as verified
|
||||
-- Run after importing existing CMS data into the new schema.
|
||||
--
|
||||
-- This updates all pre-existing brands (imported from brands-full.json)
|
||||
-- to verified=true, hidden=false, enrichment_status='complete'.
|
||||
|
||||
UPDATE funeral_brand
|
||||
SET verified = TRUE,
|
||||
hidden = FALSE,
|
||||
enrichment_status = 'complete',
|
||||
listing_tier = 'verified',
|
||||
updated_at = NOW()
|
||||
WHERE id IN (
|
||||
-- IDs from the existing 12 brands in brands-full.json
|
||||
-- These will be populated during the initial CMS data import.
|
||||
-- Update this list to match actual imported IDs.
|
||||
SELECT id FROM funeral_brand WHERE source_key IS NULL
|
||||
);
|
||||
|
||||
-- Alternatively, if importing with known codes:
|
||||
-- UPDATE funeral_brand SET verified = TRUE, hidden = FALSE, enrichment_status = 'complete'
|
||||
-- WHERE code IN ('hparsons', 'parsons-ladies', 'rankins', 'killick', 'botanical',
|
||||
-- 'easy', 'wollongong-city', 'kenneallys', 'lady-anne',
|
||||
-- 'mackay', 'mannings', 'guardian');
|
||||
196
n8n/PROCESS.md
Normal file
196
n8n/PROCESS.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# Provider Discovery Pipeline — End-to-End Process
|
||||
|
||||
Plain-English walkthrough of what the n8n workflows do, in what order, and how the data they produce lands in the database.
|
||||
|
||||
The four workflows in `workflows/` together form a continuous pipeline:
|
||||
**Discover → Find websites → Enrich with pricing → Refresh periodically.**
|
||||
Each workflow is an n8n schedule that shells out to Python scripts in `/opt/crawlers` (the `crawlers/` folder, mounted into the n8n container).
|
||||
|
||||
---
|
||||
|
||||
## The big picture
|
||||
|
||||
We're trying to populate the site with every funeral director in Australia, even before they've signed up with us. A provider starts life as a name and phone number from a public register and progressively gets enriched — website, description, packages, prices — until it either has enough data to be useful, or we've exhausted what's publicly available.
|
||||
|
||||
All discovered providers are **hidden by default** (`funeral_brand.hidden = 1`) and **unverified** (`verified = 0`) until an admin reviews them. The pipeline never modifies a provider that has signed up (`verified = 1`) — those are treated as authoritative.
|
||||
|
||||
A provider's data quality is summarised by a `listing_tier`:
|
||||
|
||||
| Tier | Means |
|
||||
|------|-------|
|
||||
| `listed` | Contact details only — we know the business exists |
|
||||
| `estimated` | At least one package with a total price |
|
||||
| `priced` | Two or more packages with itemised line items |
|
||||
| `verified` | Signed-up partner (set manually, not by the pipeline) |
|
||||
|
||||
The tier is recomputed after every enrichment pass and drives what the frontend shows.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 1 — Weekly Discovery
|
||||
**Runs:** Mondays at 02:00 AEST
|
||||
**File:** `workflows/1_weekly_discovery.json`
|
||||
|
||||
### What it does
|
||||
Three source crawlers run in parallel against public registers:
|
||||
|
||||
1. **VIC Consumer Affairs Register** (`crawl_vic_register.py`) — ~796 Victorian funeral directors, scraped from the government register HTML.
|
||||
2. **Funerals Australia** (`crawl_funerals_australia.py`) — ~997 members, fetched from their AJAX member-search API.
|
||||
3. **NFDA** (`crawl_nfda.py`) — ~209 records from their WordPress store-locator API.
|
||||
|
||||
Each crawler writes its raw response to `source_record` and logs the run to `source_log`. Then the merge step waits for all three to finish and `dedup.py` runs, which is the interesting part: it matches records across sources by a combination of fuzzy name + postcode + (when available) ABN, merges duplicates into a single `funeral_brand` row, and attaches the per-source records to it.
|
||||
|
||||
Finally n8n queries how many new `listed`-tier providers appeared in the last 7 days and emits a summary.
|
||||
|
||||
### Where the data lands
|
||||
- `source_log` — one row per crawler run (start/finish, counts, errors).
|
||||
- `source_record` — one row per raw record pulled from each source (e.g. a VIC Register entry). `raw_data` is the JSON as retrieved; `normalized_data` is the cleaned version.
|
||||
- `funeral_brand` — one row per unique business (post-dedup). Receives `title`, `phone`, `email`, `website` (if the source provided one), `business_address`, `business_suburb`, `business_state`, `business_postcode`, `source_key`, `source_url`. `hidden = 1`, `verified = 0`, `enrichment_status = 'pending'`, `listing_tier = 'listed'`.
|
||||
- `location` — one or more rows per brand (multi-location providers). Receives `title`, `address`, `suburb`, `state`, `postcode`, `lat`/`lng` where the source provides them.
|
||||
- `source_record.matched_brand_id` — back-pointer to the `funeral_brand` row that each raw record was merged into, with `match_type` indicating how (e.g. `abn`, `name_postcode`, `fuzzy_name`).
|
||||
|
||||
---
|
||||
|
||||
## Workflow 2 — Daily Website Discovery
|
||||
**Runs:** Every day at 04:00 AEST
|
||||
**File:** `workflows/2_daily_website_discovery.json`
|
||||
|
||||
### What it does
|
||||
For providers where `funeral_brand.website IS NULL`, tries to find a website in two passes:
|
||||
|
||||
1. **ABN Lookup** (`lookup_abn.py`) — calls the free Australian Business Register API to validate the business is real and attach a verified ABN + registered state/postcode. This doesn't find websites, but it strengthens the dedup key and marks the business as active.
|
||||
2. **Website discovery** (`discover_websites.py`) — uses three strategies in order:
|
||||
- **Serper.dev** — Google-backed search ("{business name} {suburb} {state}"), takes the first non-directory result. 2,500 free queries.
|
||||
- **DuckDuckGo lite** — free fallback when Serper isn't configured or exhausted.
|
||||
- **URL guessing** — generates plausible domains from the business name (e.g. `smithfunerals.com.au`) and checks if they're live.
|
||||
|
||||
Each candidate URL is fetched and validated: the page must load, the title/body must mention the business name, and the domain must not be a known directory (Yellow Pages, True Local, etc.). A confidence level (`confirmed`/`probable`/`unverified`) is recorded.
|
||||
|
||||
Each run processes a batch of 100 providers. With ~469 needing websites, a fresh dataset fills up in ~5 days.
|
||||
|
||||
### Where the data lands
|
||||
- `funeral_brand.abn` — from ABR lookup.
|
||||
- `funeral_brand.website` — the validated URL, if found.
|
||||
- `funeral_brand.business_state` / `business_postcode` — overwritten with ABR values if they were missing or lower-quality.
|
||||
- `source_record` — a new row with `source_name = 'website_discovery'` capturing the search query, all candidates considered, and why each was rejected. Useful for audit.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 3 — Daily Enrichment
|
||||
**Runs:** Every day at 06:00 AEST
|
||||
**File:** `workflows/3_daily_enrichment.json`
|
||||
|
||||
This is the most complex workflow and the one that produces pricing data. It has two phases.
|
||||
|
||||
### Phase A — Crawl websites (Python)
|
||||
`enrich_websites.py --limit=50` runs first, picking up providers where `website IS NOT NULL AND enrichment_status = 'pending'`. For each:
|
||||
|
||||
1. Fetch the homepage; extract meta description into `funeral_brand.description`.
|
||||
2. Try ~20 common pricing URL patterns (`/pricing`, `/packages`, `/funeral-costs`, `/transparency`, etc.), parse the sitemap, and follow any link whose text contains "pric", "packag", "cost", or "service".
|
||||
3. If a pricing page is found, save the cleaned body text. If a pricing PDF is linked, record its URL.
|
||||
4. Write the result to `source_record` as `source_name = 'website_crawl'` — `raw_data` includes `pricing_text`, `pricing_url`, `pdf_links`, `has_pricing` flag.
|
||||
|
||||
At this point we have raw pricing text but no structured packages yet.
|
||||
|
||||
### Phase B — AI extraction (n8n + Claude Haiku)
|
||||
n8n then queries `source_record` for unprocessed website crawls that have pricing text (>100 chars):
|
||||
|
||||
1. For each, it pulls the full pricing text (up to 5000 chars).
|
||||
2. Sends it to Claude Haiku with a strict JSON schema prompt asking for packages, funeral types, prices, and inclusions. The prompt constrains `funeralType` to the five allowed enum values and nudges toward the 16 standard inclusion type names.
|
||||
3. Parses the JSON response (tolerant of markdown wrapping).
|
||||
4. Inserts the packages and inclusions back into the DB.
|
||||
5. Marks the source record processed and the brand as `enrichment_status = 'complete'`.
|
||||
|
||||
Finally `compute_tiers.py` runs and promotes brands whose new data now meets the `estimated` or `priced` thresholds.
|
||||
|
||||
Batch size is 20 AI extractions per run. At ~$0.002 per call, a full 469-provider pass costs ~$1.
|
||||
|
||||
### Where the data lands
|
||||
- `funeral_brand.description` — from meta tags on the homepage.
|
||||
- `funeral_brand.enrichment_status` — `'complete'` on success, `'partial'` or `'failed'` otherwise.
|
||||
- `funeral_brand.last_enriched_at` — timestamp, used by Workflow 4.
|
||||
- `source_record` — `source_name = 'website_crawl'` with `raw_data.pricing_text`, `pricing_url`, `pdf_links`, `has_pricing`. `processed_at` is set once AI extraction completes.
|
||||
- `package` — one row per package found. `title`, `funeral_type` (constrained enum), `brand_id`, `source_url = 'ai_extraction'`, `extraction_confidence = 0.7`.
|
||||
- `package_inclusion` — one row per line item inside each package. `price`, `optional`, `complimentary`, `inclusion_type_title`, `package_id`.
|
||||
- `funeral_brand.listing_tier` — recomputed by `compute_tiers.py`.
|
||||
|
||||
### How the listing tier gets computed
|
||||
`compute_tiers.py` looks at each brand's packages:
|
||||
- 2+ packages, each with at least one priced inclusion → `priced`.
|
||||
- 1+ packages with a total price → `estimated`.
|
||||
- Everything else → `listed`.
|
||||
- `verified = 1` always beats the computed tier.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 4 — Monthly Refresh
|
||||
**Runs:** 1st of each month at 03:00 AEST
|
||||
**File:** `workflows/4_monthly_refresh.json`
|
||||
|
||||
### What it does
|
||||
Pricing changes. Providers update their sites, add packages, drop services. This workflow keeps the dataset fresh:
|
||||
|
||||
1. Find providers where `verified = 0 AND website IS NOT NULL AND last_enriched_at < 30 days ago`.
|
||||
2. Set their `enrichment_status` back to `'pending'`.
|
||||
3. Re-run `enrich_websites.py --limit=200` against them — this re-crawls pricing pages and writes fresh `source_record` rows (old ones are kept for audit/history).
|
||||
4. Workflow 3 will then pick them up over the following days for AI re-extraction.
|
||||
5. `compute_tiers.py` runs to catch any tier changes.
|
||||
|
||||
New packages are inserted alongside old ones; `compute_tiers` looks at the current set. (A cleanup of stale packages isn't wired up yet — noted in `crawlers/PIPELINE.md` as a future improvement.)
|
||||
|
||||
### Where the data lands
|
||||
Same tables as Workflow 3, but you'll see multiple `source_record` rows per brand over time, which forms a change history.
|
||||
|
||||
---
|
||||
|
||||
## Schema summary
|
||||
|
||||
```
|
||||
funeral_brand (the provider — one per business)
|
||||
├─ location (1..n — physical premises with lat/lng)
|
||||
├─ package (0..n — a pricing offering)
|
||||
│ └─ package_inclusion (0..n — line items inside the package)
|
||||
├─ known_for (0..n — descriptive tags, not yet populated by pipeline)
|
||||
└─ brand_funeral_area (many-to-many → funeral_area — service coverage, not yet populated)
|
||||
|
||||
source_log (one per crawler run)
|
||||
source_record (one per raw record from a source, linked back to funeral_brand)
|
||||
```
|
||||
|
||||
Pipeline never touches `funeral_home` (the parent corporation, e.g. InvoCare) or `funeral_area` (service area definitions) — those are populated manually or from other processes.
|
||||
|
||||
### Columns the pipeline writes vs. leaves alone
|
||||
|
||||
| Column | Written by | Notes |
|
||||
|--------|------------|-------|
|
||||
| `funeral_brand.title` | WF1 | From source registries |
|
||||
| `funeral_brand.phone`, `email` | WF1 | From source registries |
|
||||
| `funeral_brand.website` | WF1 or WF2 | Source registry if given, else discovered |
|
||||
| `funeral_brand.abn` | WF2 | From ABR |
|
||||
| `funeral_brand.description` | WF3 | Meta tags |
|
||||
| `funeral_brand.business_*` | WF1/WF2 | Preferring ABR values where available |
|
||||
| `funeral_brand.enrichment_status` | WF3/WF4 | State machine: `pending → partial → complete`, `failed` on error |
|
||||
| `funeral_brand.last_enriched_at` | WF3 | Used by WF4 for staleness check |
|
||||
| `funeral_brand.listing_tier` | `compute_tiers.py` | After WF3/WF4 |
|
||||
| `funeral_brand.source_key`, `source_url` | WF1 | Immutable once set |
|
||||
| `funeral_brand.verified`, `hidden` | **Never written by pipeline** | Admin-only |
|
||||
| `funeral_brand.background_colour`, `foreground_colour`, `modal_description`, `funeral_home_id` | **Never written by pipeline** | Admin/branding concern |
|
||||
| `package.*` | WF3 (Claude Haiku) | `source_url = 'ai_extraction'`, confidence 0.7 |
|
||||
| `package_inclusion.*` | WF3 (Claude Haiku) | `inclusion_type_title` pulled from a 16-item vocabulary |
|
||||
| `location.*` | WF1 | `lat`/`lng` only when source provides; `google_place_key`/`rating` require Places API (not yet wired) |
|
||||
|
||||
### The admin review flow (out of pipeline scope)
|
||||
|
||||
A provider stays `hidden = 1` until an admin reviews it. The intended flow (not yet built — listed under "What's left to do" in the memory) is:
|
||||
1. Admin UI lists newly enriched brands, sorted by tier.
|
||||
2. Admin sets `hidden = 0` to publish. They can also set `verified = 1` if the provider has signed on as a partner — this protects them from future pipeline updates.
|
||||
|
||||
---
|
||||
|
||||
## Running manually vs. via n8n
|
||||
|
||||
Everything n8n does can be reproduced with shell commands. The `crawlers/run_overnight.sh` script is effectively a single-pass equivalent of Workflows 1–3 back-to-back, useful for local testing or if n8n isn't available.
|
||||
|
||||
The n8n workflows are the production scheduler — they batch smaller chunks, run them at sensible hours (keeping server load and external API rate limits in mind), and handle the Claude Haiku HTTP calls natively (the Python scripts don't do AI extraction; they only prepare the text for n8n to send).
|
||||
|
||||
See `README.md` in this folder for setup.
|
||||
110
n8n/README.md
Normal file
110
n8n/README.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# N8N Workflow Setup
|
||||
|
||||
For a plain-English walkthrough of what the pipeline does end-to-end and how
|
||||
its output conforms to the database schema, see [`PROCESS.md`](./PROCESS.md).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker & Docker Compose
|
||||
- API keys (see below)
|
||||
|
||||
## API Keys
|
||||
|
||||
Create `crawlers/config.json` from the template:
|
||||
|
||||
```bash
|
||||
cp crawlers/config.example.json crawlers/config.json
|
||||
```
|
||||
|
||||
| Key | Service | Cost | Get it at |
|
||||
|-----|---------|------|-----------|
|
||||
| `serper_api_key` | Serper.dev (Google search) | 2,500 free | https://serper.dev |
|
||||
| `abr_guid` | ABR (ABN lookup) | Free | https://abr.business.gov.au/Tools/WebServices |
|
||||
| `anthropic_api_key` | Claude Haiku (AI extraction) | ~$2/full run | https://console.anthropic.com |
|
||||
|
||||
Also set `ANTHROPIC_API_KEY` as an N8N credential/environment variable.
|
||||
|
||||
## Start N8N
|
||||
|
||||
```bash
|
||||
cd n8n/
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
N8N will be available at http://localhost:5678
|
||||
|
||||
## Import Workflows
|
||||
|
||||
In the N8N UI:
|
||||
|
||||
1. Go to **Workflows** → **Import from File**
|
||||
2. Import each file from `n8n/workflows/`:
|
||||
- `1_weekly_discovery.json` — discovers new providers from registries
|
||||
- `2_daily_website_discovery.json` — finds provider websites
|
||||
- `3_daily_enrichment.json` — crawls sites & AI-extracts pricing
|
||||
- `4_monthly_refresh.json` — re-checks pricing for stale data
|
||||
3. Activate each workflow
|
||||
|
||||
## Workflow Schedule
|
||||
|
||||
| # | Workflow | Schedule | What It Does |
|
||||
|---|---------|----------|-------------|
|
||||
| 1 | Weekly Discovery | Mon 2am AEST | Crawls VIC Register, Funerals AU, NFDA → dedup |
|
||||
| 2 | Daily Website Discovery | 4am AEST | Finds websites for 100 providers/day |
|
||||
| 3 | Daily Enrichment | 6am AEST | Crawls 50 websites/day → AI extracts pricing |
|
||||
| 4 | Monthly Refresh | 1st of month, 3am | Re-checks pricing older than 30 days |
|
||||
|
||||
## Workflow Flow
|
||||
|
||||
```
|
||||
Mon 2am Daily 4am Daily 6am Monthly
|
||||
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
|
||||
│Registry │ │ ABN │ │ Crawl │ │ Reset │
|
||||
│Crawlers │ │ Lookup │ │ Websites │ │ Stale │
|
||||
│(VIC,FA, │ │ (free) │ │ (50/day) │ │Providers│
|
||||
│ NFDA) │ │ │ │ │ │ │
|
||||
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
|
||||
│ Dedup │ │ Serper │ │ Claude │ │Re-enrich│
|
||||
│& Merge │ │ Search │ │ Haiku AI │ │ Batch │
|
||||
│ │ │(100/day) │ │ Extract │ │ │
|
||||
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
New providers Websites found Packages & Updated tiers
|
||||
queued in DB tiers updated
|
||||
```
|
||||
|
||||
## Manual Run
|
||||
|
||||
You can also run the pipeline manually without N8N:
|
||||
|
||||
```bash
|
||||
cd crawlers/
|
||||
|
||||
# Full pipeline
|
||||
python3 crawl_all.py
|
||||
python3 dedup.py
|
||||
python3 lookup_abn.py --limit=100
|
||||
python3 discover_websites.py --limit=100
|
||||
python3 enrich_websites.py --limit=50
|
||||
python3 compute_tiers.py
|
||||
|
||||
# Test mode
|
||||
python3 crawl_all.py --test
|
||||
python3 discover_websites.py --limit=5 --state=VIC
|
||||
python3 enrich_websites.py --limit=3
|
||||
```
|
||||
|
||||
## Database
|
||||
|
||||
The pipeline uses SQLite at `database/providers.db` for the demo.
|
||||
A Postgres schema is at `database/schema.sql` for production.
|
||||
|
||||
To reset:
|
||||
```bash
|
||||
rm database/providers.db
|
||||
sqlite3 database/providers.db < database/schema_sqlite.sql
|
||||
```
|
||||
53
n8n/docker-compose.yml
Normal file
53
n8n/docker-compose.yml
Normal file
@@ -0,0 +1,53 @@
|
||||
version: "3.8"
|
||||
|
||||
services:
|
||||
n8n:
|
||||
image: n8nio/n8n:latest
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "5678:5678"
|
||||
environment:
|
||||
- N8N_HOST=localhost
|
||||
- N8N_PORT=5678
|
||||
- N8N_PROTOCOL=http
|
||||
- WEBHOOK_URL=http://localhost:5678/
|
||||
- N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY:-change-me-in-production}
|
||||
# Database
|
||||
- DB_TYPE=postgresdb
|
||||
- DB_POSTGRESDB_HOST=postgres
|
||||
- DB_POSTGRESDB_PORT=5432
|
||||
- DB_POSTGRESDB_DATABASE=n8n
|
||||
- DB_POSTGRESDB_USER=n8n
|
||||
- DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD:-n8n_password}
|
||||
# Allow running shell commands (needed for our Python crawlers)
|
||||
- N8N_ALLOW_EXEC=true
|
||||
# Timezone
|
||||
- GENERIC_TIMEZONE=Australia/Sydney
|
||||
- TZ=Australia/Sydney
|
||||
volumes:
|
||||
- n8n_data:/home/node/.n8n
|
||||
# Mount our crawler code so N8N can execute it
|
||||
- ../crawlers:/opt/crawlers:ro
|
||||
- ../database:/opt/database
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
|
||||
postgres:
|
||||
image: postgres:16-alpine
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- POSTGRES_DB=n8n
|
||||
- POSTGRES_USER=n8n
|
||||
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-n8n_password}
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U n8n"]
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
volumes:
|
||||
n8n_data:
|
||||
postgres_data:
|
||||
142
n8n/workflows/1_weekly_discovery.json
Normal file
142
n8n/workflows/1_weekly_discovery.json
Normal file
@@ -0,0 +1,142 @@
|
||||
{
|
||||
"name": "1. Weekly Provider Discovery",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "weeks", "weeksInterval": 1, "triggerAtDay": 1, "triggerAtHour": 2 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Weekly Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 crawl_vic_register.py 2>&1"
|
||||
},
|
||||
"id": "crawl_vic",
|
||||
"name": "Crawl VIC Register",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 140]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 crawl_funerals_australia.py 2>&1"
|
||||
},
|
||||
"id": "crawl_fa",
|
||||
"name": "Crawl Funerals Australia",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 crawl_nfda.py 2>&1"
|
||||
},
|
||||
"id": "crawl_nfda",
|
||||
"name": "Crawl NFDA",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 460]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"mode": "passthrough"
|
||||
},
|
||||
"id": "merge_crawls",
|
||||
"name": "Wait for Crawlers",
|
||||
"type": "n8n-nodes-base.merge",
|
||||
"typeVersion": 3,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 dedup.py 2>&1"
|
||||
},
|
||||
"id": "dedup",
|
||||
"name": "Deduplicate & Merge",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [950, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"from base import get_db; db=get_db(); r=db.execute('SELECT COUNT(*) as n FROM funeral_brand WHERE listing_tier=\\'listed\\' AND created_at > datetime(\\'now\\', \\'-7 days\\')').fetchone(); print(r['n'])\" 2>&1"
|
||||
},
|
||||
"id": "count_new",
|
||||
"name": "Count New Providers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [1200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"conditions": {
|
||||
"options": { "caseSensitive": true, "leftValue": "", "typeValidation": "strict" },
|
||||
"conditions": [
|
||||
{
|
||||
"id": "new_check",
|
||||
"leftValue": "={{ $json.stdout.trim() }}",
|
||||
"rightValue": "0",
|
||||
"operator": { "type": "string", "operation": "notEquals" }
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"id": "has_new",
|
||||
"name": "Any New Providers?",
|
||||
"type": "n8n-nodes-base.if",
|
||||
"typeVersion": 2.2,
|
||||
"position": [1450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const count = $input.first().json.stdout.trim();\nreturn [{ json: { message: `Weekly discovery complete. ${count} new providers added to the database. They are queued for website discovery and enrichment.`, count: parseInt(count) } }];"
|
||||
},
|
||||
"id": "summary",
|
||||
"name": "Build Summary",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1700, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "return [{ json: { message: 'Weekly discovery complete. No new providers found.' } }];"
|
||||
},
|
||||
"id": "no_new",
|
||||
"name": "No New Providers",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1700, 420]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Weekly Schedule": {
|
||||
"main": [
|
||||
[
|
||||
{ "node": "Crawl VIC Register", "type": "main", "index": 0 },
|
||||
{ "node": "Crawl Funerals Australia", "type": "main", "index": 0 },
|
||||
{ "node": "Crawl NFDA", "type": "main", "index": 0 }
|
||||
]
|
||||
]
|
||||
},
|
||||
"Crawl VIC Register": { "main": [[ { "node": "Wait for Crawlers", "type": "main", "index": 0 } ]] },
|
||||
"Crawl Funerals Australia": { "main": [[ { "node": "Wait for Crawlers", "type": "main", "index": 0 } ]] },
|
||||
"Crawl NFDA": { "main": [[ { "node": "Wait for Crawlers", "type": "main", "index": 0 } ]] },
|
||||
"Wait for Crawlers": { "main": [[ { "node": "Deduplicate & Merge", "type": "main", "index": 0 } ]] },
|
||||
"Deduplicate & Merge": { "main": [[ { "node": "Count New Providers", "type": "main", "index": 0 } ]] },
|
||||
"Count New Providers": { "main": [[ { "node": "Any New Providers?", "type": "main", "index": 0 } ]] },
|
||||
"Any New Providers?": {
|
||||
"main": [
|
||||
[{ "node": "Build Summary", "type": "main", "index": 0 }],
|
||||
[{ "node": "No New Providers", "type": "main", "index": 0 }]
|
||||
]
|
||||
}
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
100
n8n/workflows/2_daily_website_discovery.json
Normal file
100
n8n/workflows/2_daily_website_discovery.json
Normal file
@@ -0,0 +1,100 @@
|
||||
{
|
||||
"name": "2. Daily Website Discovery",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "days", "daysInterval": 1, "triggerAtHour": 4 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Daily Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"from base import get_db; db=get_db(); n=db.execute('SELECT COUNT(*) as n FROM funeral_brand WHERE website IS NULL AND verified=0').fetchone()['n']; print(n)\" 2>&1"
|
||||
},
|
||||
"id": "check_queue",
|
||||
"name": "Check Queue Size",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"conditions": {
|
||||
"conditions": [
|
||||
{
|
||||
"id": "has_work",
|
||||
"leftValue": "={{ parseInt($json.stdout.trim()) }}",
|
||||
"rightValue": 0,
|
||||
"operator": { "type": "number", "operation": "gt" }
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"id": "has_work",
|
||||
"name": "Providers Need Websites?",
|
||||
"type": "n8n-nodes-base.if",
|
||||
"typeVersion": 2.2,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 lookup_abn.py --limit=100 2>&1"
|
||||
},
|
||||
"id": "abn_lookup",
|
||||
"name": "ABN Lookup (batch 100)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [950, 200]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 discover_websites.py --limit=100 2>&1"
|
||||
},
|
||||
"id": "discover",
|
||||
"name": "Discover Websites (batch 100)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [1250, 200]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const output = $input.first().json.stdout || '';\nconst foundMatch = output.match(/(\\d+) websites found/);\nconst found = foundMatch ? parseInt(foundMatch[1]) : 0;\nreturn [{ json: { message: `Website discovery batch complete. ${found} websites found.`, output } }];"
|
||||
},
|
||||
"id": "summary",
|
||||
"name": "Build Summary",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1500, 200]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "return [{ json: { message: 'No providers need website discovery.' } }];"
|
||||
},
|
||||
"id": "skip",
|
||||
"name": "Skip",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [950, 420]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Daily Schedule": { "main": [[ { "node": "Check Queue Size", "type": "main", "index": 0 } ]] },
|
||||
"Check Queue Size": { "main": [[ { "node": "Providers Need Websites?", "type": "main", "index": 0 } ]] },
|
||||
"Providers Need Websites?": {
|
||||
"main": [
|
||||
[{ "node": "ABN Lookup (batch 100)", "type": "main", "index": 0 }],
|
||||
[{ "node": "Skip", "type": "main", "index": 0 }]
|
||||
]
|
||||
},
|
||||
"ABN Lookup (batch 100)": { "main": [[ { "node": "Discover Websites (batch 100)", "type": "main", "index": 0 } ]] },
|
||||
"Discover Websites (batch 100)": { "main": [[ { "node": "Build Summary", "type": "main", "index": 0 } ]] }
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
146
n8n/workflows/3_daily_enrichment.json
Normal file
146
n8n/workflows/3_daily_enrichment.json
Normal file
@@ -0,0 +1,146 @@
|
||||
{
|
||||
"name": "3. Daily Website Enrichment",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "days", "daysInterval": 1, "triggerAtHour": 6 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Daily Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 enrich_websites.py --limit=50 2>&1"
|
||||
},
|
||||
"id": "enrich",
|
||||
"name": "Crawl & Extract (batch 50)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300],
|
||||
"executeOnce": true
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"\nimport json, sqlite3\ndb = sqlite3.connect('/opt/database/providers.db')\ndb.row_factory = sqlite3.Row\nrows = db.execute('''\n SELECT sr.id, sr.source_url, sr.matched_brand_id,\n json_extract(sr.raw_data, \\\"$.pricing_text\\\") as pricing_text,\n json_extract(sr.raw_data, \\\"$.has_pricing\\\") as has_pricing\n FROM source_record sr\n WHERE sr.source_name = 'website_crawl'\n AND sr.processed_at IS NULL\n AND json_extract(sr.raw_data, \\\"$.has_pricing\\\") = 1\n LIMIT 20\n''').fetchall()\nresult = [{'id': r['id'], 'brand_id': r['matched_brand_id'], 'url': r['source_url'], 'text_length': len(r['pricing_text'] or '')} for r in rows]\nprint(json.dumps(result))\n\" 2>&1"
|
||||
},
|
||||
"id": "get_queue",
|
||||
"name": "Get Pricing Pages Queue",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const output = $input.first().json.stdout.trim();\ntry {\n const items = JSON.parse(output);\n return items.map(item => ({ json: item }));\n} catch(e) {\n return [{ json: { error: 'No pricing pages to process', raw: output } }];\n}"
|
||||
},
|
||||
"id": "parse_queue",
|
||||
"name": "Parse Queue Items",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [950, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"conditions": {
|
||||
"conditions": [
|
||||
{
|
||||
"id": "has_text",
|
||||
"leftValue": "={{ $json.text_length }}",
|
||||
"rightValue": 100,
|
||||
"operator": { "type": "number", "operation": "gt" }
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"id": "has_text",
|
||||
"name": "Has Pricing Text?",
|
||||
"type": "n8n-nodes-base.if",
|
||||
"typeVersion": 2.2,
|
||||
"position": [1200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "={{ 'cd /opt/crawlers && python3 -c \"import json, sqlite3; db=sqlite3.connect(\\'/opt/database/providers.db\\'); r=db.execute(\\'SELECT json_extract(raw_data, \\\\\\\"$.pricing_text\\\\\\\") as t FROM source_record WHERE id=' + $json.id + '\\').fetchone(); print(r[0][:6000] if r and r[0] else \\'\\')\"' }}"
|
||||
},
|
||||
"id": "get_text",
|
||||
"name": "Get Pricing Text",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [1450, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"url": "https://api.anthropic.com/v1/messages",
|
||||
"sendHeaders": true,
|
||||
"headerParameters": {
|
||||
"parameters": [
|
||||
{ "name": "x-api-key", "value": "={{ $env.ANTHROPIC_API_KEY }}" },
|
||||
{ "name": "anthropic-version", "value": "2023-06-01" },
|
||||
{ "name": "content-type", "value": "application/json" }
|
||||
]
|
||||
},
|
||||
"sendBody": true,
|
||||
"specifyBody": "json",
|
||||
"jsonBody": "={{ JSON.stringify({ model: 'claude-haiku-4-5-20251001', max_tokens: 2048, messages: [{ role: 'user', content: 'Extract funeral packages and pricing from this funeral director\\'s pricing page. Return ONLY valid JSON matching this schema:\\n\\n{\\n \"packages\": [\\n {\\n \"name\": \"Package name\",\\n \"funeralType\": \"one of: Service & Cremation, Service & Burial, Cremation Only, Graveside Burial\",\\n \"price\": 0,\\n \"inclusions\": [\\n {\"item\": \"Inclusion name\", \"price\": 0, \"optional\": false, \"complimentary\": false}\\n ]\\n }\\n ]\\n}\\n\\nUse these inclusion type names where possible: Professional Service Fee, Transportation Service Fee, Professional Mortuary Care, Death Registration Certificate, Cremation Certificate/Permit, Government Levy, Accommodation, Viewing Fee, Coffin, Cremation Fee, Saturday Service Fee, Dressing Fee, Embalming, Digital Recording, Webstreaming, After Hours Transfer Surcharge.\\n\\nIf a price cannot be determined, use null. If no packages/pricing found, return {\"packages\": []}.\\n\\nPricing page text:\\n' + $('Get Pricing Text').first().json.stdout.substring(0, 5000) }] }) }}"
|
||||
},
|
||||
"id": "ai_extract",
|
||||
"name": "AI Extract (Claude Haiku)",
|
||||
"type": "n8n-nodes-base.httpRequest",
|
||||
"typeVersion": 4.2,
|
||||
"position": [1700, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const response = $input.first().json;\nconst sourceId = $('Parse Queue Items').first().json.id;\nconst brandId = $('Parse Queue Items').first().json.brand_id;\n\nlet packages = [];\ntry {\n const content = response.content[0].text;\n // Extract JSON from the response (may be wrapped in markdown)\n const jsonMatch = content.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n const parsed = JSON.parse(jsonMatch[0]);\n packages = parsed.packages || [];\n }\n} catch(e) {\n // AI response wasn't valid JSON\n}\n\nreturn [{ json: { sourceId, brandId, packages, packageCount: packages.length } }];"
|
||||
},
|
||||
"id": "parse_ai",
|
||||
"name": "Parse AI Response",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1950, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "={{ 'cd /opt/crawlers && python3 -c \"\\nimport json, sqlite3\\ndb = sqlite3.connect(\\'/opt/database/providers.db\\')\\npackages = ' + JSON.stringify(JSON.stringify($json.packages)) + '\\npackages = json.loads(packages)\\nbrand_id = ' + $json.brandId + '\\nsource_id = ' + $json.sourceId + '\\n\\nfor pkg in packages:\\n if not pkg.get(\\'price\\'):\\n continue\\n cur = db.execute(\\n \\'INSERT INTO package (title, funeral_type, brand_id, source_url, extraction_confidence) VALUES (?, ?, ?, ?, ?)\\',\\n (pkg[\\'name\\'], pkg.get(\\'funeralType\\'), brand_id, \\'ai_extraction\\', 0.7)\\n )\\n pkg_id = cur.lastrowid\\n for inc in pkg.get(\\'inclusions\\', []):\\n if inc.get(\\'price\\') is not None:\\n db.execute(\\n \\'INSERT INTO package_inclusion (price, optional, complimentary, inclusion_type_title, package_id) VALUES (?, ?, ?, ?, ?)\\',\\n (inc[\\'price\\'], 1 if inc.get(\\'optional\\') else 0, 1 if inc.get(\\'complimentary\\') else 0, inc[\\'item\\'], pkg_id)\\n )\\n\\ndb.execute(\\'UPDATE source_record SET processed_at=datetime(\\\\\\'now\\\\\\') WHERE id=?\\', (source_id,))\\ndb.execute(\\'UPDATE funeral_brand SET enrichment_status=\\\\\\'complete\\\\\\', last_enriched_at=datetime(\\\\\\'now\\\\\\') WHERE id=?\\', (brand_id,))\\ndb.commit()\\nprint(f\\'{len(packages)} packages saved for brand {brand_id}\\')\\n\" 2>&1' }}"
|
||||
},
|
||||
"id": "save_packages",
|
||||
"name": "Save Packages to DB",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [2200, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 compute_tiers.py 2>&1"
|
||||
},
|
||||
"id": "recompute_tiers",
|
||||
"name": "Recompute Listing Tiers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [2450, 300]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Daily Schedule": { "main": [[ { "node": "Crawl & Extract (batch 50)", "type": "main", "index": 0 } ]] },
|
||||
"Crawl & Extract (batch 50)": { "main": [[ { "node": "Get Pricing Pages Queue", "type": "main", "index": 0 } ]] },
|
||||
"Get Pricing Pages Queue": { "main": [[ { "node": "Parse Queue Items", "type": "main", "index": 0 } ]] },
|
||||
"Parse Queue Items": { "main": [[ { "node": "Has Pricing Text?", "type": "main", "index": 0 } ]] },
|
||||
"Has Pricing Text?": {
|
||||
"main": [
|
||||
[{ "node": "Get Pricing Text", "type": "main", "index": 0 }],
|
||||
[{ "node": "Recompute Listing Tiers", "type": "main", "index": 0 }]
|
||||
]
|
||||
},
|
||||
"Get Pricing Text": { "main": [[ { "node": "AI Extract (Claude Haiku)", "type": "main", "index": 0 } ]] },
|
||||
"AI Extract (Claude Haiku)": { "main": [[ { "node": "Parse AI Response", "type": "main", "index": 0 } ]] },
|
||||
"Parse AI Response": { "main": [[ { "node": "Save Packages to DB", "type": "main", "index": 0 } ]] },
|
||||
"Save Packages to DB": { "main": [[ { "node": "Recompute Listing Tiers", "type": "main", "index": 0 } ]] }
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
65
n8n/workflows/4_monthly_refresh.json
Normal file
65
n8n/workflows/4_monthly_refresh.json
Normal file
@@ -0,0 +1,65 @@
|
||||
{
|
||||
"name": "4. Monthly Re-enrichment",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "months", "monthsInterval": 1, "triggerAtDayOfMonth": 1, "triggerAtHour": 3 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Monthly Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"\nimport sqlite3\ndb = sqlite3.connect('/opt/database/providers.db')\n# Reset enrichment for providers last checked > 30 days ago\nupdated = db.execute('''\n UPDATE funeral_brand\n SET enrichment_status = 'pending',\n updated_at = datetime('now')\n WHERE verified = 0\n AND website IS NOT NULL\n AND last_enriched_at < datetime('now', '-30 days')\n''').rowcount\ndb.commit()\nprint(f'{updated} providers queued for re-enrichment')\n\" 2>&1"
|
||||
},
|
||||
"id": "reset_stale",
|
||||
"name": "Queue Stale Providers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 enrich_websites.py --limit=200 2>&1"
|
||||
},
|
||||
"id": "re_enrich",
|
||||
"name": "Re-enrich (batch 200)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 compute_tiers.py 2>&1"
|
||||
},
|
||||
"id": "recompute",
|
||||
"name": "Recompute Tiers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [950, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const output = $input.first().json.stdout || '';\nreturn [{ json: { message: 'Monthly re-enrichment complete.', output } }];"
|
||||
},
|
||||
"id": "summary",
|
||||
"name": "Summary",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1200, 300]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Monthly Schedule": { "main": [[ { "node": "Queue Stale Providers", "type": "main", "index": 0 } ]] },
|
||||
"Queue Stale Providers": { "main": [[ { "node": "Re-enrich (batch 200)", "type": "main", "index": 0 } ]] },
|
||||
"Re-enrich (batch 200)": { "main": [[ { "node": "Recompute Tiers", "type": "main", "index": 0 } ]] },
|
||||
"Recompute Tiers": { "main": [[ { "node": "Summary", "type": "main", "index": 0 } ]] }
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
Reference in New Issue
Block a user