Initial commit: funeral provider discovery pipeline
Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
This commit is contained in:
98
README.md
Normal file
98
README.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Provider Crawl
|
||||
|
||||
Automated pipeline for discovering Australian funeral directors from public
|
||||
registries, finding their websites, and extracting pricing data. Feeds the
|
||||
Funeral Arranger platform with a seed of unverified providers that an admin
|
||||
reviews before publishing.
|
||||
|
||||
## What's in this repo
|
||||
|
||||
```
|
||||
crawlers/ Python modules — one per data source + shared pipeline utilities
|
||||
n8n/ n8n workflows that orchestrate the crawlers on a schedule
|
||||
database/ SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB
|
||||
```
|
||||
|
||||
Three documents explain how it works, in increasing depth:
|
||||
|
||||
1. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough
|
||||
of the four workflows and how their output maps to database tables.
|
||||
**Start here.**
|
||||
2. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the
|
||||
Python modules, source-by-source notes, listing-tier logic.
|
||||
3. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with
|
||||
Docker and import the workflow JSONs.
|
||||
|
||||
## Quick start (local, no n8n)
|
||||
|
||||
```bash
|
||||
# 1. Install Python deps (requests, beautifulsoup4, rapidfuzz, pdfplumber)
|
||||
python3 -m venv .venv && source .venv/bin/activate
|
||||
pip install requests beautifulsoup4 rapidfuzz pdfplumber
|
||||
|
||||
# 2. Configure API keys
|
||||
cp crawlers/config.example.json crawlers/config.json
|
||||
# Edit crawlers/config.json and add your Serper key (free 2,500/month from serper.dev)
|
||||
# Optionally add an Anthropic API key for AI pricing extraction
|
||||
|
||||
# 3. Inspect the included dev database (1,463 providers, 121 with pricing)
|
||||
sqlite3 database/providers.db ".tables"
|
||||
sqlite3 database/providers.db "SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier"
|
||||
|
||||
# 4. Or start fresh and run the pipeline end-to-end
|
||||
rm database/providers.db
|
||||
sqlite3 database/providers.db < database/schema_sqlite.sql
|
||||
cd crawlers
|
||||
./run_overnight.sh # equivalent to workflows 1–3 back to back
|
||||
```
|
||||
|
||||
## API keys needed
|
||||
|
||||
| Service | Purpose | Cost | Where |
|
||||
|---|---|---|---|
|
||||
| Serper.dev | Website discovery via Google search | 2,500 free/mo | https://serper.dev |
|
||||
| ABR (Australian Business Register) | ABN validation | Free (GUID required) | https://abr.business.gov.au/Tools/WebServices |
|
||||
| Anthropic | AI pricing extraction (Claude Haiku) | ~$2 / full run | https://console.anthropic.com |
|
||||
|
||||
Only Serper is required to run the full pipeline. Anthropic is only needed for
|
||||
the AI extraction step in Workflow 3 (can be skipped — Python crawl still
|
||||
populates pricing text into `source_record` for manual review).
|
||||
|
||||
## Database
|
||||
|
||||
The included `database/providers.db` is a live SQLite snapshot from the last
|
||||
overnight run: 1,463 unique providers, 994 with websites, 121 with pricing.
|
||||
|
||||
Schema files:
|
||||
|
||||
- `database/schema_sqlite.sql` — SQLite schema, used for local dev
|
||||
- `database/schema.sql` — Postgres schema (production target)
|
||||
- `database/seed_verified.sql` — seed data for partner (`verified = 1`) providers
|
||||
- `database/PROVIDER-SCHEMA-SPEC.md` — schema commentary and design notes
|
||||
- `database/IMAGE-MAPPING.md` — asset conventions for verified providers
|
||||
|
||||
Per `n8n/PROCESS.md`, the pipeline only writes to a defined subset of columns —
|
||||
`verified`, `hidden`, branding, and `funeral_home_id` are admin-only.
|
||||
|
||||
## Status
|
||||
|
||||
What's built and working:
|
||||
|
||||
- All three source crawlers (VIC Register, Funerals Australia, NFDA)
|
||||
- Cross-source deduplication with fuzzy name/postcode/ABN matching
|
||||
- Website discovery via Serper + DDG + URL guessing
|
||||
- ABN lookup via ABR
|
||||
- Website enrichment: meta descriptions, pricing page discovery, PDF detection
|
||||
- Listing tier computation
|
||||
- Four n8n workflow JSONs ready to import
|
||||
|
||||
What's not yet built (open to pick up):
|
||||
|
||||
- Google Places API integration for richer location data (ratings, place_id)
|
||||
- Playwright-based enrichment for JS-rendered sites (~37% of sites)
|
||||
- Admin review UI for approving hidden providers before they go live
|
||||
- Stale package cleanup in the monthly refresh
|
||||
|
||||
## Contact
|
||||
|
||||
Richie — richie@tensordesign.com.au
|
||||
Reference in New Issue
Block a user