Files

Richie cc91427789 Initial commit: funeral provider discovery pipeline

Python crawlers for VIC Register, Funerals Australia, NFDA
n8n workflows for scheduled discovery and enrichment
SQLite schema and seeded dev database (1,463 providers)
End-to-end process documentation in n8n/PROCESS.md

2026-04-24 10:27:08 +10:00

4.0 KiB

Raw Blame History

N8N Workflow Setup

For a plain-English walkthrough of what the pipeline does end-to-end and how its output conforms to the database schema, see PROCESS.md.

Prerequisites

Docker & Docker Compose
API keys (see below)

API Keys

Create crawlers/config.json from the template:

cp crawlers/config.example.json crawlers/config.json

Key	Service	Cost	Get it at
`serper_api_key`	Serper.dev (Google search)	2,500 free	https://serper.dev
`abr_guid`	ABR (ABN lookup)	Free	https://abr.business.gov.au/Tools/WebServices
`anthropic_api_key`	Claude Haiku (AI extraction)	~$2/full run	https://console.anthropic.com

Also set ANTHROPIC_API_KEY as an N8N credential/environment variable.

Start N8N

cd n8n/
docker compose up -d

N8N will be available at http://localhost:5678

Import Workflows

In the N8N UI:

Go to Workflows → Import from File
Import each file from n8n/workflows/:
- 1_weekly_discovery.json — discovers new providers from registries
- 2_daily_website_discovery.json — finds provider websites
- 3_daily_enrichment.json — crawls sites & AI-extracts pricing
- 4_monthly_refresh.json — re-checks pricing for stale data
Activate each workflow

Workflow Schedule

#	Workflow	Schedule	What It Does
1	Weekly Discovery	Mon 2am AEST	Crawls VIC Register, Funerals AU, NFDA → dedup
2	Daily Website Discovery	4am AEST	Finds websites for 100 providers/day
3	Daily Enrichment	6am AEST	Crawls 50 websites/day → AI extracts pricing
4	Monthly Refresh	1st of month, 3am	Re-checks pricing older than 30 days

Workflow Flow

  Mon 2am          Daily 4am           Daily 6am         Monthly
  ┌────────┐      ┌──────────┐       ┌──────────┐      ┌─────────┐
  │Registry │      │  ABN     │       │ Crawl    │      │ Reset   │
  │Crawlers │      │  Lookup  │       │ Websites │      │ Stale   │
  │(VIC,FA, │      │  (free)  │       │ (50/day) │      │Providers│
  │ NFDA)   │      │          │       │          │      │         │
  └────┬───┘      └────┬────┘       └────┬────┘      └────┬────┘
       │               │                 │                 │
       ▼               ▼                 ▼                 ▼
  ┌────────┐      ┌──────────┐       ┌──────────┐      ┌─────────┐
  │ Dedup  │      │ Serper   │       │ Claude   │      │Re-enrich│
  │& Merge │      │ Search   │       │ Haiku AI │      │  Batch  │
  │        │      │(100/day) │       │ Extract  │      │         │
  └────┬───┘      └────┬────┘       └────┬────┘      └────┬────┘
       │               │                 │                 │
       ▼               ▼                 ▼                 ▼
  New providers    Websites found     Packages &       Updated tiers
  queued           in DB              tiers updated

Manual Run

You can also run the pipeline manually without N8N:

cd crawlers/

# Full pipeline
python3 crawl_all.py
python3 dedup.py
python3 lookup_abn.py --limit=100
python3 discover_websites.py --limit=100
python3 enrich_websites.py --limit=50
python3 compute_tiers.py

# Test mode
python3 crawl_all.py --test
python3 discover_websites.py --limit=5 --state=VIC
python3 enrich_websites.py --limit=3

Database

The pipeline uses SQLite at database/providers.db for the demo. A Postgres schema is at database/schema.sql for production.

To reset:

rm database/providers.db
sqlite3 database/providers.db < database/schema_sqlite.sql

4.0 KiB Raw Blame History