Provider-Crawl/n8n/README.md

# N8N Workflow Setup

For a plain-English walkthrough of what the pipeline does end-to-end and how
its output conforms to the database schema, see [`PROCESS.md`](./PROCESS.md).

## Prerequisites

- Docker & Docker Compose
- API keys (see below)

## API Keys

Create `crawlers/config.json` from the template:

```bash
cp crawlers/config.example.json crawlers/config.json
```

| Key | Service | Cost | Get it at |
|-----|---------|------|-----------|
| `serper_api_key` | Serper.dev (Google search) | 2,500 free | https://serper.dev |
| `abr_guid` | ABR (ABN lookup) | Free | https://abr.business.gov.au/Tools/WebServices |
| `anthropic_api_key` | Claude Haiku (AI extraction) | ~$2/full run | https://console.anthropic.com |

Also set `ANTHROPIC_API_KEY` as an N8N credential/environment variable.

## Start N8N

```bash
cd n8n/
docker compose up -d
```

N8N will be available at http://localhost:5678

## Import Workflows

In the N8N UI:

1. Go to **Workflows** → **Import from File**
2. Import each file from `n8n/workflows/`:
   - `1_weekly_discovery.json` — discovers new providers from registries
   - `2_daily_website_discovery.json` — finds provider websites
   - `3_daily_enrichment.json` — crawls sites & AI-extracts pricing
   - `4_monthly_refresh.json` — re-checks pricing for stale data
3. Activate each workflow

## Workflow Schedule

| # | Workflow | Schedule | What It Does |
|---|---------|----------|-------------|
| 1 | Weekly Discovery | Mon 2am AEST | Crawls VIC Register, Funerals AU, NFDA → dedup |
| 2 | Daily Website Discovery | 4am AEST | Finds websites for 100 providers/day |
| 3 | Daily Enrichment | 6am AEST | Crawls 50 websites/day → AI extracts pricing |
| 4 | Monthly Refresh | 1st of month, 3am | Re-checks pricing older than 30 days |

## Workflow Flow

```
  Mon 2am          Daily 4am           Daily 6am         Monthly
  ┌────────┐      ┌──────────┐       ┌──────────┐      ┌─────────┐
  │Registry │      │  ABN     │       │ Crawl    │      │ Reset   │
  │Crawlers │      │  Lookup  │       │ Websites │      │ Stale   │
  │(VIC,FA, │      │  (free)  │       │ (50/day) │      │Providers│
  │ NFDA)   │      │          │       │          │      │         │
  └────┬───┘      └────┬────┘       └────┬────┘      └────┬────┘
       │               │                 │                 │
       ▼               ▼                 ▼                 ▼
  ┌────────┐      ┌──────────┐       ┌──────────┐      ┌─────────┐
  │ Dedup  │      │ Serper   │       │ Claude   │      │Re-enrich│
  │& Merge │      │ Search   │       │ Haiku AI │      │  Batch  │
  │        │      │(100/day) │       │ Extract  │      │         │
  └────┬───┘      └────┬────┘       └────┬────┘      └────┬────┘
       │               │                 │                 │
       ▼               ▼                 ▼                 ▼
  New providers    Websites found     Packages &       Updated tiers
  queued           in DB              tiers updated
```

## Manual Run

You can also run the pipeline manually without N8N:

```bash
cd crawlers/

# Full pipeline
python3 crawl_all.py
python3 dedup.py
python3 lookup_abn.py --limit=100
python3 discover_websites.py --limit=100
python3 enrich_websites.py --limit=50
python3 compute_tiers.py

# Test mode
python3 crawl_all.py --test
python3 discover_websites.py --limit=5 --state=VIC
python3 enrich_websites.py --limit=3
```

## Database

The pipeline uses SQLite at `database/providers.db` for the demo.
A Postgres schema is at `database/schema.sql` for production.

To reset:
```bash
rm database/providers.db
sqlite3 database/providers.db < database/schema_sqlite.sql
```