# Provider Crawl Automated pipeline for discovering Australian funeral directors from public registries, finding their websites, and extracting pricing data. Feeds the Funeral Arranger platform with a seed of unverified providers that an admin reviews before publishing. ## What's in this repo ``` crawlers/ Python modules — one per data source + shared pipeline utilities n8n/ n8n workflows that orchestrate the crawlers on a schedule database/ SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB ``` Four documents explain how it works, in increasing depth: 1. **[ONBOARDING.md](ONBOARDING.md)** — Human context: what the project is for, ground rules, open work, how to get unblocked. **Start here if you're new to the project.** 2. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough of the four workflows and how their output maps to database tables. 3. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the Python modules, source-by-source notes, listing-tier logic. 4. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with Docker and import the workflow JSONs. ## Quick start (local, no n8n) ```bash # 1. Install Python deps (requests, beautifulsoup4, rapidfuzz, pdfplumber) python3 -m venv .venv && source .venv/bin/activate pip install requests beautifulsoup4 rapidfuzz pdfplumber # 2. Configure API keys cp crawlers/config.example.json crawlers/config.json # Edit crawlers/config.json and add your Serper key (free 2,500/month from serper.dev) # Optionally add an Anthropic API key for AI pricing extraction # 3. Inspect the included dev database (1,463 providers, 121 with pricing) sqlite3 database/providers.db ".tables" sqlite3 database/providers.db "SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier" # 4. Or start fresh and run the pipeline end-to-end rm database/providers.db sqlite3 database/providers.db < database/schema_sqlite.sql cd crawlers ./run_overnight.sh # equivalent to workflows 1–3 back to back ``` ## API keys needed | Service | Purpose | Cost | Where | |---|---|---|---| | Serper.dev | Website discovery via Google search | 2,500 free/mo | https://serper.dev | | ABR (Australian Business Register) | ABN validation | Free (GUID required) | https://abr.business.gov.au/Tools/WebServices | | Anthropic | AI pricing extraction (Claude Haiku) | ~$2 / full run | https://console.anthropic.com | Only Serper is required to run the full pipeline. Anthropic is only needed for the AI extraction step in Workflow 3 (can be skipped — Python crawl still populates pricing text into `source_record` for manual review). ## Database The included `database/providers.db` is a live SQLite snapshot from the last overnight run: 1,463 unique providers, 994 with websites, 121 with pricing. Schema files: - `database/schema_sqlite.sql` — SQLite schema, used for local dev - `database/schema.sql` — Postgres schema (production target) - `database/seed_verified.sql` — seed data for partner (`verified = 1`) providers - `database/PROVIDER-SCHEMA-SPEC.md` — schema commentary and design notes - `database/IMAGE-MAPPING.md` — asset conventions for verified providers Per `n8n/PROCESS.md`, the pipeline only writes to a defined subset of columns — `verified`, `hidden`, branding, and `funeral_home_id` are admin-only. ## Status What's built and working: - All three source crawlers (VIC Register, Funerals Australia, NFDA) - Cross-source deduplication with fuzzy name/postcode/ABN matching - Website discovery via Serper + DDG + URL guessing - ABN lookup via ABR - Website enrichment: meta descriptions, pricing page discovery, PDF detection - Listing tier computation - Four n8n workflow JSONs ready to import What's not yet built (open to pick up): - Google Places API integration for richer location data (ratings, place_id) - Playwright-based enrichment for JS-rendered sites (~37% of sites) - Admin review UI for approving hidden providers before they go live - Stale package cleanup in the monthly refresh ## Contact Richie — richie@tensordesign.com.au