Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
111 lines
4.0 KiB
Markdown
111 lines
4.0 KiB
Markdown
# N8N Workflow Setup
|
|
|
|
For a plain-English walkthrough of what the pipeline does end-to-end and how
|
|
its output conforms to the database schema, see [`PROCESS.md`](./PROCESS.md).
|
|
|
|
## Prerequisites
|
|
|
|
- Docker & Docker Compose
|
|
- API keys (see below)
|
|
|
|
## API Keys
|
|
|
|
Create `crawlers/config.json` from the template:
|
|
|
|
```bash
|
|
cp crawlers/config.example.json crawlers/config.json
|
|
```
|
|
|
|
| Key | Service | Cost | Get it at |
|
|
|-----|---------|------|-----------|
|
|
| `serper_api_key` | Serper.dev (Google search) | 2,500 free | https://serper.dev |
|
|
| `abr_guid` | ABR (ABN lookup) | Free | https://abr.business.gov.au/Tools/WebServices |
|
|
| `anthropic_api_key` | Claude Haiku (AI extraction) | ~$2/full run | https://console.anthropic.com |
|
|
|
|
Also set `ANTHROPIC_API_KEY` as an N8N credential/environment variable.
|
|
|
|
## Start N8N
|
|
|
|
```bash
|
|
cd n8n/
|
|
docker compose up -d
|
|
```
|
|
|
|
N8N will be available at http://localhost:5678
|
|
|
|
## Import Workflows
|
|
|
|
In the N8N UI:
|
|
|
|
1. Go to **Workflows** → **Import from File**
|
|
2. Import each file from `n8n/workflows/`:
|
|
- `1_weekly_discovery.json` — discovers new providers from registries
|
|
- `2_daily_website_discovery.json` — finds provider websites
|
|
- `3_daily_enrichment.json` — crawls sites & AI-extracts pricing
|
|
- `4_monthly_refresh.json` — re-checks pricing for stale data
|
|
3. Activate each workflow
|
|
|
|
## Workflow Schedule
|
|
|
|
| # | Workflow | Schedule | What It Does |
|
|
|---|---------|----------|-------------|
|
|
| 1 | Weekly Discovery | Mon 2am AEST | Crawls VIC Register, Funerals AU, NFDA → dedup |
|
|
| 2 | Daily Website Discovery | 4am AEST | Finds websites for 100 providers/day |
|
|
| 3 | Daily Enrichment | 6am AEST | Crawls 50 websites/day → AI extracts pricing |
|
|
| 4 | Monthly Refresh | 1st of month, 3am | Re-checks pricing older than 30 days |
|
|
|
|
## Workflow Flow
|
|
|
|
```
|
|
Mon 2am Daily 4am Daily 6am Monthly
|
|
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
|
|
│Registry │ │ ABN │ │ Crawl │ │ Reset │
|
|
│Crawlers │ │ Lookup │ │ Websites │ │ Stale │
|
|
│(VIC,FA, │ │ (free) │ │ (50/day) │ │Providers│
|
|
│ NFDA) │ │ │ │ │ │ │
|
|
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
|
|
│ │ │ │
|
|
▼ ▼ ▼ ▼
|
|
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
|
|
│ Dedup │ │ Serper │ │ Claude │ │Re-enrich│
|
|
│& Merge │ │ Search │ │ Haiku AI │ │ Batch │
|
|
│ │ │(100/day) │ │ Extract │ │ │
|
|
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
|
|
│ │ │ │
|
|
▼ ▼ ▼ ▼
|
|
New providers Websites found Packages & Updated tiers
|
|
queued in DB tiers updated
|
|
```
|
|
|
|
## Manual Run
|
|
|
|
You can also run the pipeline manually without N8N:
|
|
|
|
```bash
|
|
cd crawlers/
|
|
|
|
# Full pipeline
|
|
python3 crawl_all.py
|
|
python3 dedup.py
|
|
python3 lookup_abn.py --limit=100
|
|
python3 discover_websites.py --limit=100
|
|
python3 enrich_websites.py --limit=50
|
|
python3 compute_tiers.py
|
|
|
|
# Test mode
|
|
python3 crawl_all.py --test
|
|
python3 discover_websites.py --limit=5 --state=VIC
|
|
python3 enrich_websites.py --limit=3
|
|
```
|
|
|
|
## Database
|
|
|
|
The pipeline uses SQLite at `database/providers.db` for the demo.
|
|
A Postgres schema is at `database/schema.sql` for production.
|
|
|
|
To reset:
|
|
```bash
|
|
rm database/providers.db
|
|
sqlite3 database/providers.db < database/schema_sqlite.sql
|
|
```
|