Initial commit: funeral provider discovery pipeline

Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
2026-04-24 10:27:08 +10:00
commit cc91427789
30 changed files with 4706 additions and 0 deletions
--- a/n8n/README.md
+++ b/n8n/README.md
@@ -0,0 +1,110 @@
+# N8N Workflow Setup
+
+For a plain-English walkthrough of what the pipeline does end-to-end and how
+its output conforms to the database schema, see [`PROCESS.md`](./PROCESS.md).
+
+## Prerequisites
+
+- Docker & Docker Compose
+- API keys (see below)
+
+## API Keys
+
+Create `crawlers/config.json` from the template:
+
+```bash
+cp crawlers/config.example.json crawlers/config.json
+```
+
+| Key | Service | Cost | Get it at |
+|-----|---------|------|-----------|
+| `serper_api_key` | Serper.dev (Google search) | 2,500 free | https://serper.dev |
+| `abr_guid` | ABR (ABN lookup) | Free | https://abr.business.gov.au/Tools/WebServices |
+| `anthropic_api_key` | Claude Haiku (AI extraction) | ~$2/full run | https://console.anthropic.com |
+
+Also set `ANTHROPIC_API_KEY` as an N8N credential/environment variable.
+
+## Start N8N
+
+```bash
+cd n8n/
+docker compose up -d
+```
+
+N8N will be available at http://localhost:5678
+
+## Import Workflows
+
+In the N8N UI:
+
+1. Go to **Workflows** → **Import from File**
+2. Import each file from `n8n/workflows/`:
+   - `1_weekly_discovery.json` — discovers new providers from registries
+   - `2_daily_website_discovery.json` — finds provider websites
+   - `3_daily_enrichment.json` — crawls sites & AI-extracts pricing
+   - `4_monthly_refresh.json` — re-checks pricing for stale data
+3. Activate each workflow
+
+## Workflow Schedule
+
+| # | Workflow | Schedule | What It Does |
+|---|---------|----------|-------------|
+| 1 | Weekly Discovery | Mon 2am AEST | Crawls VIC Register, Funerals AU, NFDA → dedup |
+| 2 | Daily Website Discovery | 4am AEST | Finds websites for 100 providers/day |
+| 3 | Daily Enrichment | 6am AEST | Crawls 50 websites/day → AI extracts pricing |
+| 4 | Monthly Refresh | 1st of month, 3am | Re-checks pricing older than 30 days |
+
+## Workflow Flow
+
+```
+  Mon 2am          Daily 4am           Daily 6am         Monthly
+  ┌────────┐      ┌──────────┐       ┌──────────┐      ┌─────────┐
+  │Registry │      │  ABN     │       │ Crawl    │      │ Reset   │
+  │Crawlers │      │  Lookup  │       │ Websites │      │ Stale   │
+  │(VIC,FA, │      │  (free)  │       │ (50/day) │      │Providers│
+  │ NFDA)   │      │          │       │          │      │         │
+  └────┬───┘      └────┬────┘       └────┬────┘      └────┬────┘
+       │               │                 │                 │
+       ▼               ▼                 ▼                 ▼
+  ┌────────┐      ┌──────────┐       ┌──────────┐      ┌─────────┐
+  │ Dedup  │      │ Serper   │       │ Claude   │      │Re-enrich│
+  │& Merge │      │ Search   │       │ Haiku AI │      │  Batch  │
+  │        │      │(100/day) │       │ Extract  │      │         │
+  └────┬───┘      └────┬────┘       └────┬────┘      └────┬────┘
+       │               │                 │                 │
+       ▼               ▼                 ▼                 ▼
+  New providers    Websites found     Packages &       Updated tiers
+  queued           in DB              tiers updated
+```
+
+## Manual Run
+
+You can also run the pipeline manually without N8N:
+
+```bash
+cd crawlers/
+
+# Full pipeline
+python3 crawl_all.py
+python3 dedup.py
+python3 lookup_abn.py --limit=100
+python3 discover_websites.py --limit=100
+python3 enrich_websites.py --limit=50
+python3 compute_tiers.py
+
+# Test mode
+python3 crawl_all.py --test
+python3 discover_websites.py --limit=5 --state=VIC
+python3 enrich_websites.py --limit=3
+```
+
+## Database
+
+The pipeline uses SQLite at `database/providers.db` for the demo.
+A Postgres schema is at `database/schema.sql` for production.
+
+To reset:
+```bash
+rm database/providers.db
+sqlite3 database/providers.db < database/schema_sqlite.sql
+```