Files
Provider-Crawl/README.md
2026-04-24 10:29:40 +10:00

4.0 KiB
Raw Permalink Blame History

Provider Crawl

Automated pipeline for discovering Australian funeral directors from public registries, finding their websites, and extracting pricing data. Feeds the Funeral Arranger platform with a seed of unverified providers that an admin reviews before publishing.

What's in this repo

crawlers/    Python modules — one per data source + shared pipeline utilities
n8n/         n8n workflows that orchestrate the crawlers on a schedule
database/    SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB

Four documents explain how it works, in increasing depth:

  1. ONBOARDING.md — Human context: what the project is for, ground rules, open work, how to get unblocked. Start here if you're new to the project.
  2. n8n/PROCESS.md — Plain-English end-to-end walkthrough of the four workflows and how their output maps to database tables.
  3. crawlers/PIPELINE.md — Architecture of the Python modules, source-by-source notes, listing-tier logic.
  4. n8n/README.md — How to stand up n8n locally with Docker and import the workflow JSONs.

Quick start (local, no n8n)

# 1. Install Python deps (requests, beautifulsoup4, rapidfuzz, pdfplumber)
python3 -m venv .venv && source .venv/bin/activate
pip install requests beautifulsoup4 rapidfuzz pdfplumber

# 2. Configure API keys
cp crawlers/config.example.json crawlers/config.json
# Edit crawlers/config.json and add your Serper key (free 2,500/month from serper.dev)
# Optionally add an Anthropic API key for AI pricing extraction

# 3. Inspect the included dev database (1,463 providers, 121 with pricing)
sqlite3 database/providers.db ".tables"
sqlite3 database/providers.db "SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier"

# 4. Or start fresh and run the pipeline end-to-end
rm database/providers.db
sqlite3 database/providers.db < database/schema_sqlite.sql
cd crawlers
./run_overnight.sh       # equivalent to workflows 13 back to back

API keys needed

Service Purpose Cost Where
Serper.dev Website discovery via Google search 2,500 free/mo https://serper.dev
ABR (Australian Business Register) ABN validation Free (GUID required) https://abr.business.gov.au/Tools/WebServices
Anthropic AI pricing extraction (Claude Haiku) ~$2 / full run https://console.anthropic.com

Only Serper is required to run the full pipeline. Anthropic is only needed for the AI extraction step in Workflow 3 (can be skipped — Python crawl still populates pricing text into source_record for manual review).

Database

The included database/providers.db is a live SQLite snapshot from the last overnight run: 1,463 unique providers, 994 with websites, 121 with pricing.

Schema files:

  • database/schema_sqlite.sql — SQLite schema, used for local dev
  • database/schema.sql — Postgres schema (production target)
  • database/seed_verified.sql — seed data for partner (verified = 1) providers
  • database/PROVIDER-SCHEMA-SPEC.md — schema commentary and design notes
  • database/IMAGE-MAPPING.md — asset conventions for verified providers

Per n8n/PROCESS.md, the pipeline only writes to a defined subset of columns — verified, hidden, branding, and funeral_home_id are admin-only.

Status

What's built and working:

  • All three source crawlers (VIC Register, Funerals Australia, NFDA)
  • Cross-source deduplication with fuzzy name/postcode/ABN matching
  • Website discovery via Serper + DDG + URL guessing
  • ABN lookup via ABR
  • Website enrichment: meta descriptions, pricing page discovery, PDF detection
  • Listing tier computation
  • Four n8n workflow JSONs ready to import

What's not yet built (open to pick up):

  • Google Places API integration for richer location data (ratings, place_id)
  • Playwright-based enrichment for JS-rendered sites (~37% of sites)
  • Admin review UI for approving hidden providers before they go live
  • Stale package cleanup in the monthly refresh

Contact

Richie — richie@tensordesign.com.au