Files
Provider-Crawl/README.md
2026-04-24 10:29:40 +10:00

101 lines
4.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Provider Crawl
Automated pipeline for discovering Australian funeral directors from public
registries, finding their websites, and extracting pricing data. Feeds the
Funeral Arranger platform with a seed of unverified providers that an admin
reviews before publishing.
## What's in this repo
```
crawlers/ Python modules — one per data source + shared pipeline utilities
n8n/ n8n workflows that orchestrate the crawlers on a schedule
database/ SQLite + Postgres schemas, a seed dump, and a pre-populated dev DB
```
Four documents explain how it works, in increasing depth:
1. **[ONBOARDING.md](ONBOARDING.md)** — Human context: what the project is
for, ground rules, open work, how to get unblocked. **Start here if you're
new to the project.**
2. **[n8n/PROCESS.md](n8n/PROCESS.md)** — Plain-English end-to-end walkthrough
of the four workflows and how their output maps to database tables.
3. **[crawlers/PIPELINE.md](crawlers/PIPELINE.md)** — Architecture of the
Python modules, source-by-source notes, listing-tier logic.
4. **[n8n/README.md](n8n/README.md)** — How to stand up n8n locally with
Docker and import the workflow JSONs.
## Quick start (local, no n8n)
```bash
# 1. Install Python deps (requests, beautifulsoup4, rapidfuzz, pdfplumber)
python3 -m venv .venv && source .venv/bin/activate
pip install requests beautifulsoup4 rapidfuzz pdfplumber
# 2. Configure API keys
cp crawlers/config.example.json crawlers/config.json
# Edit crawlers/config.json and add your Serper key (free 2,500/month from serper.dev)
# Optionally add an Anthropic API key for AI pricing extraction
# 3. Inspect the included dev database (1,463 providers, 121 with pricing)
sqlite3 database/providers.db ".tables"
sqlite3 database/providers.db "SELECT listing_tier, COUNT(*) FROM funeral_brand GROUP BY listing_tier"
# 4. Or start fresh and run the pipeline end-to-end
rm database/providers.db
sqlite3 database/providers.db < database/schema_sqlite.sql
cd crawlers
./run_overnight.sh # equivalent to workflows 13 back to back
```
## API keys needed
| Service | Purpose | Cost | Where |
|---|---|---|---|
| Serper.dev | Website discovery via Google search | 2,500 free/mo | https://serper.dev |
| ABR (Australian Business Register) | ABN validation | Free (GUID required) | https://abr.business.gov.au/Tools/WebServices |
| Anthropic | AI pricing extraction (Claude Haiku) | ~$2 / full run | https://console.anthropic.com |
Only Serper is required to run the full pipeline. Anthropic is only needed for
the AI extraction step in Workflow 3 (can be skipped — Python crawl still
populates pricing text into `source_record` for manual review).
## Database
The included `database/providers.db` is a live SQLite snapshot from the last
overnight run: 1,463 unique providers, 994 with websites, 121 with pricing.
Schema files:
- `database/schema_sqlite.sql` — SQLite schema, used for local dev
- `database/schema.sql` — Postgres schema (production target)
- `database/seed_verified.sql` — seed data for partner (`verified = 1`) providers
- `database/PROVIDER-SCHEMA-SPEC.md` — schema commentary and design notes
- `database/IMAGE-MAPPING.md` — asset conventions for verified providers
Per `n8n/PROCESS.md`, the pipeline only writes to a defined subset of columns —
`verified`, `hidden`, branding, and `funeral_home_id` are admin-only.
## Status
What's built and working:
- All three source crawlers (VIC Register, Funerals Australia, NFDA)
- Cross-source deduplication with fuzzy name/postcode/ABN matching
- Website discovery via Serper + DDG + URL guessing
- ABN lookup via ABR
- Website enrichment: meta descriptions, pricing page discovery, PDF detection
- Listing tier computation
- Four n8n workflow JSONs ready to import
What's not yet built (open to pick up):
- Google Places API integration for richer location data (ratings, place_id)
- Playwright-based enrichment for JS-rendered sites (~37% of sites)
- Admin review UI for approving hidden providers before they go live
- Stale package cleanup in the monthly refresh
## Contact
Richie — richie@tensordesign.com.au