Initial commit: funeral provider discovery pipeline

Python crawlers for VIC Register, Funerals Australia, NFDA
n8n workflows for scheduled discovery and enrichment
SQLite schema and seeded dev database (1,463 providers)
End-to-end process documentation in n8n/PROCESS.md
This commit is contained in:
Richie
2026-04-24 10:27:08 +10:00
commit cc91427789
30 changed files with 4706 additions and 0 deletions

110
n8n/README.md Normal file
View File

@@ -0,0 +1,110 @@
# N8N Workflow Setup
For a plain-English walkthrough of what the pipeline does end-to-end and how
its output conforms to the database schema, see [`PROCESS.md`](./PROCESS.md).
## Prerequisites
- Docker & Docker Compose
- API keys (see below)
## API Keys
Create `crawlers/config.json` from the template:
```bash
cp crawlers/config.example.json crawlers/config.json
```
| Key | Service | Cost | Get it at |
|-----|---------|------|-----------|
| `serper_api_key` | Serper.dev (Google search) | 2,500 free | https://serper.dev |
| `abr_guid` | ABR (ABN lookup) | Free | https://abr.business.gov.au/Tools/WebServices |
| `anthropic_api_key` | Claude Haiku (AI extraction) | ~$2/full run | https://console.anthropic.com |
Also set `ANTHROPIC_API_KEY` as an N8N credential/environment variable.
## Start N8N
```bash
cd n8n/
docker compose up -d
```
N8N will be available at http://localhost:5678
## Import Workflows
In the N8N UI:
1. Go to **Workflows****Import from File**
2. Import each file from `n8n/workflows/`:
- `1_weekly_discovery.json` — discovers new providers from registries
- `2_daily_website_discovery.json` — finds provider websites
- `3_daily_enrichment.json` — crawls sites & AI-extracts pricing
- `4_monthly_refresh.json` — re-checks pricing for stale data
3. Activate each workflow
## Workflow Schedule
| # | Workflow | Schedule | What It Does |
|---|---------|----------|-------------|
| 1 | Weekly Discovery | Mon 2am AEST | Crawls VIC Register, Funerals AU, NFDA → dedup |
| 2 | Daily Website Discovery | 4am AEST | Finds websites for 100 providers/day |
| 3 | Daily Enrichment | 6am AEST | Crawls 50 websites/day → AI extracts pricing |
| 4 | Monthly Refresh | 1st of month, 3am | Re-checks pricing older than 30 days |
## Workflow Flow
```
Mon 2am Daily 4am Daily 6am Monthly
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
│Registry │ │ ABN │ │ Crawl │ │ Reset │
│Crawlers │ │ Lookup │ │ Websites │ │ Stale │
│(VIC,FA, │ │ (free) │ │ (50/day) │ │Providers│
│ NFDA) │ │ │ │ │ │ │
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
│ Dedup │ │ Serper │ │ Claude │ │Re-enrich│
│& Merge │ │ Search │ │ Haiku AI │ │ Batch │
│ │ │(100/day) │ │ Extract │ │ │
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
▼ ▼ ▼ ▼
New providers Websites found Packages & Updated tiers
queued in DB tiers updated
```
## Manual Run
You can also run the pipeline manually without N8N:
```bash
cd crawlers/
# Full pipeline
python3 crawl_all.py
python3 dedup.py
python3 lookup_abn.py --limit=100
python3 discover_websites.py --limit=100
python3 enrich_websites.py --limit=50
python3 compute_tiers.py
# Test mode
python3 crawl_all.py --test
python3 discover_websites.py --limit=5 --state=VIC
python3 enrich_websites.py --limit=3
```
## Database
The pipeline uses SQLite at `database/providers.db` for the demo.
A Postgres schema is at `database/schema.sql` for production.
To reset:
```bash
rm database/providers.db
sqlite3 database/providers.db < database/schema_sqlite.sql
```