Initial commit: funeral provider discovery pipeline
Python crawlers for VIC Register, Funerals Australia, NFDA n8n workflows for scheduled discovery and enrichment SQLite schema and seeded dev database (1,463 providers) End-to-end process documentation in n8n/PROCESS.md
This commit is contained in:
196
n8n/PROCESS.md
Normal file
196
n8n/PROCESS.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# Provider Discovery Pipeline — End-to-End Process
|
||||
|
||||
Plain-English walkthrough of what the n8n workflows do, in what order, and how the data they produce lands in the database.
|
||||
|
||||
The four workflows in `workflows/` together form a continuous pipeline:
|
||||
**Discover → Find websites → Enrich with pricing → Refresh periodically.**
|
||||
Each workflow is an n8n schedule that shells out to Python scripts in `/opt/crawlers` (the `crawlers/` folder, mounted into the n8n container).
|
||||
|
||||
---
|
||||
|
||||
## The big picture
|
||||
|
||||
We're trying to populate the site with every funeral director in Australia, even before they've signed up with us. A provider starts life as a name and phone number from a public register and progressively gets enriched — website, description, packages, prices — until it either has enough data to be useful, or we've exhausted what's publicly available.
|
||||
|
||||
All discovered providers are **hidden by default** (`funeral_brand.hidden = 1`) and **unverified** (`verified = 0`) until an admin reviews them. The pipeline never modifies a provider that has signed up (`verified = 1`) — those are treated as authoritative.
|
||||
|
||||
A provider's data quality is summarised by a `listing_tier`:
|
||||
|
||||
| Tier | Means |
|
||||
|------|-------|
|
||||
| `listed` | Contact details only — we know the business exists |
|
||||
| `estimated` | At least one package with a total price |
|
||||
| `priced` | Two or more packages with itemised line items |
|
||||
| `verified` | Signed-up partner (set manually, not by the pipeline) |
|
||||
|
||||
The tier is recomputed after every enrichment pass and drives what the frontend shows.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 1 — Weekly Discovery
|
||||
**Runs:** Mondays at 02:00 AEST
|
||||
**File:** `workflows/1_weekly_discovery.json`
|
||||
|
||||
### What it does
|
||||
Three source crawlers run in parallel against public registers:
|
||||
|
||||
1. **VIC Consumer Affairs Register** (`crawl_vic_register.py`) — ~796 Victorian funeral directors, scraped from the government register HTML.
|
||||
2. **Funerals Australia** (`crawl_funerals_australia.py`) — ~997 members, fetched from their AJAX member-search API.
|
||||
3. **NFDA** (`crawl_nfda.py`) — ~209 records from their WordPress store-locator API.
|
||||
|
||||
Each crawler writes its raw response to `source_record` and logs the run to `source_log`. Then the merge step waits for all three to finish and `dedup.py` runs, which is the interesting part: it matches records across sources by a combination of fuzzy name + postcode + (when available) ABN, merges duplicates into a single `funeral_brand` row, and attaches the per-source records to it.
|
||||
|
||||
Finally n8n queries how many new `listed`-tier providers appeared in the last 7 days and emits a summary.
|
||||
|
||||
### Where the data lands
|
||||
- `source_log` — one row per crawler run (start/finish, counts, errors).
|
||||
- `source_record` — one row per raw record pulled from each source (e.g. a VIC Register entry). `raw_data` is the JSON as retrieved; `normalized_data` is the cleaned version.
|
||||
- `funeral_brand` — one row per unique business (post-dedup). Receives `title`, `phone`, `email`, `website` (if the source provided one), `business_address`, `business_suburb`, `business_state`, `business_postcode`, `source_key`, `source_url`. `hidden = 1`, `verified = 0`, `enrichment_status = 'pending'`, `listing_tier = 'listed'`.
|
||||
- `location` — one or more rows per brand (multi-location providers). Receives `title`, `address`, `suburb`, `state`, `postcode`, `lat`/`lng` where the source provides them.
|
||||
- `source_record.matched_brand_id` — back-pointer to the `funeral_brand` row that each raw record was merged into, with `match_type` indicating how (e.g. `abn`, `name_postcode`, `fuzzy_name`).
|
||||
|
||||
---
|
||||
|
||||
## Workflow 2 — Daily Website Discovery
|
||||
**Runs:** Every day at 04:00 AEST
|
||||
**File:** `workflows/2_daily_website_discovery.json`
|
||||
|
||||
### What it does
|
||||
For providers where `funeral_brand.website IS NULL`, tries to find a website in two passes:
|
||||
|
||||
1. **ABN Lookup** (`lookup_abn.py`) — calls the free Australian Business Register API to validate the business is real and attach a verified ABN + registered state/postcode. This doesn't find websites, but it strengthens the dedup key and marks the business as active.
|
||||
2. **Website discovery** (`discover_websites.py`) — uses three strategies in order:
|
||||
- **Serper.dev** — Google-backed search ("{business name} {suburb} {state}"), takes the first non-directory result. 2,500 free queries.
|
||||
- **DuckDuckGo lite** — free fallback when Serper isn't configured or exhausted.
|
||||
- **URL guessing** — generates plausible domains from the business name (e.g. `smithfunerals.com.au`) and checks if they're live.
|
||||
|
||||
Each candidate URL is fetched and validated: the page must load, the title/body must mention the business name, and the domain must not be a known directory (Yellow Pages, True Local, etc.). A confidence level (`confirmed`/`probable`/`unverified`) is recorded.
|
||||
|
||||
Each run processes a batch of 100 providers. With ~469 needing websites, a fresh dataset fills up in ~5 days.
|
||||
|
||||
### Where the data lands
|
||||
- `funeral_brand.abn` — from ABR lookup.
|
||||
- `funeral_brand.website` — the validated URL, if found.
|
||||
- `funeral_brand.business_state` / `business_postcode` — overwritten with ABR values if they were missing or lower-quality.
|
||||
- `source_record` — a new row with `source_name = 'website_discovery'` capturing the search query, all candidates considered, and why each was rejected. Useful for audit.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 3 — Daily Enrichment
|
||||
**Runs:** Every day at 06:00 AEST
|
||||
**File:** `workflows/3_daily_enrichment.json`
|
||||
|
||||
This is the most complex workflow and the one that produces pricing data. It has two phases.
|
||||
|
||||
### Phase A — Crawl websites (Python)
|
||||
`enrich_websites.py --limit=50` runs first, picking up providers where `website IS NOT NULL AND enrichment_status = 'pending'`. For each:
|
||||
|
||||
1. Fetch the homepage; extract meta description into `funeral_brand.description`.
|
||||
2. Try ~20 common pricing URL patterns (`/pricing`, `/packages`, `/funeral-costs`, `/transparency`, etc.), parse the sitemap, and follow any link whose text contains "pric", "packag", "cost", or "service".
|
||||
3. If a pricing page is found, save the cleaned body text. If a pricing PDF is linked, record its URL.
|
||||
4. Write the result to `source_record` as `source_name = 'website_crawl'` — `raw_data` includes `pricing_text`, `pricing_url`, `pdf_links`, `has_pricing` flag.
|
||||
|
||||
At this point we have raw pricing text but no structured packages yet.
|
||||
|
||||
### Phase B — AI extraction (n8n + Claude Haiku)
|
||||
n8n then queries `source_record` for unprocessed website crawls that have pricing text (>100 chars):
|
||||
|
||||
1. For each, it pulls the full pricing text (up to 5000 chars).
|
||||
2. Sends it to Claude Haiku with a strict JSON schema prompt asking for packages, funeral types, prices, and inclusions. The prompt constrains `funeralType` to the five allowed enum values and nudges toward the 16 standard inclusion type names.
|
||||
3. Parses the JSON response (tolerant of markdown wrapping).
|
||||
4. Inserts the packages and inclusions back into the DB.
|
||||
5. Marks the source record processed and the brand as `enrichment_status = 'complete'`.
|
||||
|
||||
Finally `compute_tiers.py` runs and promotes brands whose new data now meets the `estimated` or `priced` thresholds.
|
||||
|
||||
Batch size is 20 AI extractions per run. At ~$0.002 per call, a full 469-provider pass costs ~$1.
|
||||
|
||||
### Where the data lands
|
||||
- `funeral_brand.description` — from meta tags on the homepage.
|
||||
- `funeral_brand.enrichment_status` — `'complete'` on success, `'partial'` or `'failed'` otherwise.
|
||||
- `funeral_brand.last_enriched_at` — timestamp, used by Workflow 4.
|
||||
- `source_record` — `source_name = 'website_crawl'` with `raw_data.pricing_text`, `pricing_url`, `pdf_links`, `has_pricing`. `processed_at` is set once AI extraction completes.
|
||||
- `package` — one row per package found. `title`, `funeral_type` (constrained enum), `brand_id`, `source_url = 'ai_extraction'`, `extraction_confidence = 0.7`.
|
||||
- `package_inclusion` — one row per line item inside each package. `price`, `optional`, `complimentary`, `inclusion_type_title`, `package_id`.
|
||||
- `funeral_brand.listing_tier` — recomputed by `compute_tiers.py`.
|
||||
|
||||
### How the listing tier gets computed
|
||||
`compute_tiers.py` looks at each brand's packages:
|
||||
- 2+ packages, each with at least one priced inclusion → `priced`.
|
||||
- 1+ packages with a total price → `estimated`.
|
||||
- Everything else → `listed`.
|
||||
- `verified = 1` always beats the computed tier.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 4 — Monthly Refresh
|
||||
**Runs:** 1st of each month at 03:00 AEST
|
||||
**File:** `workflows/4_monthly_refresh.json`
|
||||
|
||||
### What it does
|
||||
Pricing changes. Providers update their sites, add packages, drop services. This workflow keeps the dataset fresh:
|
||||
|
||||
1. Find providers where `verified = 0 AND website IS NOT NULL AND last_enriched_at < 30 days ago`.
|
||||
2. Set their `enrichment_status` back to `'pending'`.
|
||||
3. Re-run `enrich_websites.py --limit=200` against them — this re-crawls pricing pages and writes fresh `source_record` rows (old ones are kept for audit/history).
|
||||
4. Workflow 3 will then pick them up over the following days for AI re-extraction.
|
||||
5. `compute_tiers.py` runs to catch any tier changes.
|
||||
|
||||
New packages are inserted alongside old ones; `compute_tiers` looks at the current set. (A cleanup of stale packages isn't wired up yet — noted in `crawlers/PIPELINE.md` as a future improvement.)
|
||||
|
||||
### Where the data lands
|
||||
Same tables as Workflow 3, but you'll see multiple `source_record` rows per brand over time, which forms a change history.
|
||||
|
||||
---
|
||||
|
||||
## Schema summary
|
||||
|
||||
```
|
||||
funeral_brand (the provider — one per business)
|
||||
├─ location (1..n — physical premises with lat/lng)
|
||||
├─ package (0..n — a pricing offering)
|
||||
│ └─ package_inclusion (0..n — line items inside the package)
|
||||
├─ known_for (0..n — descriptive tags, not yet populated by pipeline)
|
||||
└─ brand_funeral_area (many-to-many → funeral_area — service coverage, not yet populated)
|
||||
|
||||
source_log (one per crawler run)
|
||||
source_record (one per raw record from a source, linked back to funeral_brand)
|
||||
```
|
||||
|
||||
Pipeline never touches `funeral_home` (the parent corporation, e.g. InvoCare) or `funeral_area` (service area definitions) — those are populated manually or from other processes.
|
||||
|
||||
### Columns the pipeline writes vs. leaves alone
|
||||
|
||||
| Column | Written by | Notes |
|
||||
|--------|------------|-------|
|
||||
| `funeral_brand.title` | WF1 | From source registries |
|
||||
| `funeral_brand.phone`, `email` | WF1 | From source registries |
|
||||
| `funeral_brand.website` | WF1 or WF2 | Source registry if given, else discovered |
|
||||
| `funeral_brand.abn` | WF2 | From ABR |
|
||||
| `funeral_brand.description` | WF3 | Meta tags |
|
||||
| `funeral_brand.business_*` | WF1/WF2 | Preferring ABR values where available |
|
||||
| `funeral_brand.enrichment_status` | WF3/WF4 | State machine: `pending → partial → complete`, `failed` on error |
|
||||
| `funeral_brand.last_enriched_at` | WF3 | Used by WF4 for staleness check |
|
||||
| `funeral_brand.listing_tier` | `compute_tiers.py` | After WF3/WF4 |
|
||||
| `funeral_brand.source_key`, `source_url` | WF1 | Immutable once set |
|
||||
| `funeral_brand.verified`, `hidden` | **Never written by pipeline** | Admin-only |
|
||||
| `funeral_brand.background_colour`, `foreground_colour`, `modal_description`, `funeral_home_id` | **Never written by pipeline** | Admin/branding concern |
|
||||
| `package.*` | WF3 (Claude Haiku) | `source_url = 'ai_extraction'`, confidence 0.7 |
|
||||
| `package_inclusion.*` | WF3 (Claude Haiku) | `inclusion_type_title` pulled from a 16-item vocabulary |
|
||||
| `location.*` | WF1 | `lat`/`lng` only when source provides; `google_place_key`/`rating` require Places API (not yet wired) |
|
||||
|
||||
### The admin review flow (out of pipeline scope)
|
||||
|
||||
A provider stays `hidden = 1` until an admin reviews it. The intended flow (not yet built — listed under "What's left to do" in the memory) is:
|
||||
1. Admin UI lists newly enriched brands, sorted by tier.
|
||||
2. Admin sets `hidden = 0` to publish. They can also set `verified = 1` if the provider has signed on as a partner — this protects them from future pipeline updates.
|
||||
|
||||
---
|
||||
|
||||
## Running manually vs. via n8n
|
||||
|
||||
Everything n8n does can be reproduced with shell commands. The `crawlers/run_overnight.sh` script is effectively a single-pass equivalent of Workflows 1–3 back-to-back, useful for local testing or if n8n isn't available.
|
||||
|
||||
The n8n workflows are the production scheduler — they batch smaller chunks, run them at sensible hours (keeping server load and external API rate limits in mind), and handle the Claude Haiku HTTP calls natively (the Python scripts don't do AI extraction; they only prepare the text for n8n to send).
|
||||
|
||||
See `README.md` in this folder for setup.
|
||||
110
n8n/README.md
Normal file
110
n8n/README.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# N8N Workflow Setup
|
||||
|
||||
For a plain-English walkthrough of what the pipeline does end-to-end and how
|
||||
its output conforms to the database schema, see [`PROCESS.md`](./PROCESS.md).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker & Docker Compose
|
||||
- API keys (see below)
|
||||
|
||||
## API Keys
|
||||
|
||||
Create `crawlers/config.json` from the template:
|
||||
|
||||
```bash
|
||||
cp crawlers/config.example.json crawlers/config.json
|
||||
```
|
||||
|
||||
| Key | Service | Cost | Get it at |
|
||||
|-----|---------|------|-----------|
|
||||
| `serper_api_key` | Serper.dev (Google search) | 2,500 free | https://serper.dev |
|
||||
| `abr_guid` | ABR (ABN lookup) | Free | https://abr.business.gov.au/Tools/WebServices |
|
||||
| `anthropic_api_key` | Claude Haiku (AI extraction) | ~$2/full run | https://console.anthropic.com |
|
||||
|
||||
Also set `ANTHROPIC_API_KEY` as an N8N credential/environment variable.
|
||||
|
||||
## Start N8N
|
||||
|
||||
```bash
|
||||
cd n8n/
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
N8N will be available at http://localhost:5678
|
||||
|
||||
## Import Workflows
|
||||
|
||||
In the N8N UI:
|
||||
|
||||
1. Go to **Workflows** → **Import from File**
|
||||
2. Import each file from `n8n/workflows/`:
|
||||
- `1_weekly_discovery.json` — discovers new providers from registries
|
||||
- `2_daily_website_discovery.json` — finds provider websites
|
||||
- `3_daily_enrichment.json` — crawls sites & AI-extracts pricing
|
||||
- `4_monthly_refresh.json` — re-checks pricing for stale data
|
||||
3. Activate each workflow
|
||||
|
||||
## Workflow Schedule
|
||||
|
||||
| # | Workflow | Schedule | What It Does |
|
||||
|---|---------|----------|-------------|
|
||||
| 1 | Weekly Discovery | Mon 2am AEST | Crawls VIC Register, Funerals AU, NFDA → dedup |
|
||||
| 2 | Daily Website Discovery | 4am AEST | Finds websites for 100 providers/day |
|
||||
| 3 | Daily Enrichment | 6am AEST | Crawls 50 websites/day → AI extracts pricing |
|
||||
| 4 | Monthly Refresh | 1st of month, 3am | Re-checks pricing older than 30 days |
|
||||
|
||||
## Workflow Flow
|
||||
|
||||
```
|
||||
Mon 2am Daily 4am Daily 6am Monthly
|
||||
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
|
||||
│Registry │ │ ABN │ │ Crawl │ │ Reset │
|
||||
│Crawlers │ │ Lookup │ │ Websites │ │ Stale │
|
||||
│(VIC,FA, │ │ (free) │ │ (50/day) │ │Providers│
|
||||
│ NFDA) │ │ │ │ │ │ │
|
||||
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
┌────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
|
||||
│ Dedup │ │ Serper │ │ Claude │ │Re-enrich│
|
||||
│& Merge │ │ Search │ │ Haiku AI │ │ Batch │
|
||||
│ │ │(100/day) │ │ Extract │ │ │
|
||||
└────┬───┘ └────┬────┘ └────┬────┘ └────┬────┘
|
||||
│ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
New providers Websites found Packages & Updated tiers
|
||||
queued in DB tiers updated
|
||||
```
|
||||
|
||||
## Manual Run
|
||||
|
||||
You can also run the pipeline manually without N8N:
|
||||
|
||||
```bash
|
||||
cd crawlers/
|
||||
|
||||
# Full pipeline
|
||||
python3 crawl_all.py
|
||||
python3 dedup.py
|
||||
python3 lookup_abn.py --limit=100
|
||||
python3 discover_websites.py --limit=100
|
||||
python3 enrich_websites.py --limit=50
|
||||
python3 compute_tiers.py
|
||||
|
||||
# Test mode
|
||||
python3 crawl_all.py --test
|
||||
python3 discover_websites.py --limit=5 --state=VIC
|
||||
python3 enrich_websites.py --limit=3
|
||||
```
|
||||
|
||||
## Database
|
||||
|
||||
The pipeline uses SQLite at `database/providers.db` for the demo.
|
||||
A Postgres schema is at `database/schema.sql` for production.
|
||||
|
||||
To reset:
|
||||
```bash
|
||||
rm database/providers.db
|
||||
sqlite3 database/providers.db < database/schema_sqlite.sql
|
||||
```
|
||||
53
n8n/docker-compose.yml
Normal file
53
n8n/docker-compose.yml
Normal file
@@ -0,0 +1,53 @@
|
||||
version: "3.8"
|
||||
|
||||
services:
|
||||
n8n:
|
||||
image: n8nio/n8n:latest
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "5678:5678"
|
||||
environment:
|
||||
- N8N_HOST=localhost
|
||||
- N8N_PORT=5678
|
||||
- N8N_PROTOCOL=http
|
||||
- WEBHOOK_URL=http://localhost:5678/
|
||||
- N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY:-change-me-in-production}
|
||||
# Database
|
||||
- DB_TYPE=postgresdb
|
||||
- DB_POSTGRESDB_HOST=postgres
|
||||
- DB_POSTGRESDB_PORT=5432
|
||||
- DB_POSTGRESDB_DATABASE=n8n
|
||||
- DB_POSTGRESDB_USER=n8n
|
||||
- DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD:-n8n_password}
|
||||
# Allow running shell commands (needed for our Python crawlers)
|
||||
- N8N_ALLOW_EXEC=true
|
||||
# Timezone
|
||||
- GENERIC_TIMEZONE=Australia/Sydney
|
||||
- TZ=Australia/Sydney
|
||||
volumes:
|
||||
- n8n_data:/home/node/.n8n
|
||||
# Mount our crawler code so N8N can execute it
|
||||
- ../crawlers:/opt/crawlers:ro
|
||||
- ../database:/opt/database
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
|
||||
postgres:
|
||||
image: postgres:16-alpine
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- POSTGRES_DB=n8n
|
||||
- POSTGRES_USER=n8n
|
||||
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-n8n_password}
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U n8n"]
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
volumes:
|
||||
n8n_data:
|
||||
postgres_data:
|
||||
142
n8n/workflows/1_weekly_discovery.json
Normal file
142
n8n/workflows/1_weekly_discovery.json
Normal file
@@ -0,0 +1,142 @@
|
||||
{
|
||||
"name": "1. Weekly Provider Discovery",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "weeks", "weeksInterval": 1, "triggerAtDay": 1, "triggerAtHour": 2 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Weekly Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 crawl_vic_register.py 2>&1"
|
||||
},
|
||||
"id": "crawl_vic",
|
||||
"name": "Crawl VIC Register",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 140]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 crawl_funerals_australia.py 2>&1"
|
||||
},
|
||||
"id": "crawl_fa",
|
||||
"name": "Crawl Funerals Australia",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 crawl_nfda.py 2>&1"
|
||||
},
|
||||
"id": "crawl_nfda",
|
||||
"name": "Crawl NFDA",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 460]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"mode": "passthrough"
|
||||
},
|
||||
"id": "merge_crawls",
|
||||
"name": "Wait for Crawlers",
|
||||
"type": "n8n-nodes-base.merge",
|
||||
"typeVersion": 3,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 dedup.py 2>&1"
|
||||
},
|
||||
"id": "dedup",
|
||||
"name": "Deduplicate & Merge",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [950, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"from base import get_db; db=get_db(); r=db.execute('SELECT COUNT(*) as n FROM funeral_brand WHERE listing_tier=\\'listed\\' AND created_at > datetime(\\'now\\', \\'-7 days\\')').fetchone(); print(r['n'])\" 2>&1"
|
||||
},
|
||||
"id": "count_new",
|
||||
"name": "Count New Providers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [1200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"conditions": {
|
||||
"options": { "caseSensitive": true, "leftValue": "", "typeValidation": "strict" },
|
||||
"conditions": [
|
||||
{
|
||||
"id": "new_check",
|
||||
"leftValue": "={{ $json.stdout.trim() }}",
|
||||
"rightValue": "0",
|
||||
"operator": { "type": "string", "operation": "notEquals" }
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"id": "has_new",
|
||||
"name": "Any New Providers?",
|
||||
"type": "n8n-nodes-base.if",
|
||||
"typeVersion": 2.2,
|
||||
"position": [1450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const count = $input.first().json.stdout.trim();\nreturn [{ json: { message: `Weekly discovery complete. ${count} new providers added to the database. They are queued for website discovery and enrichment.`, count: parseInt(count) } }];"
|
||||
},
|
||||
"id": "summary",
|
||||
"name": "Build Summary",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1700, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "return [{ json: { message: 'Weekly discovery complete. No new providers found.' } }];"
|
||||
},
|
||||
"id": "no_new",
|
||||
"name": "No New Providers",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1700, 420]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Weekly Schedule": {
|
||||
"main": [
|
||||
[
|
||||
{ "node": "Crawl VIC Register", "type": "main", "index": 0 },
|
||||
{ "node": "Crawl Funerals Australia", "type": "main", "index": 0 },
|
||||
{ "node": "Crawl NFDA", "type": "main", "index": 0 }
|
||||
]
|
||||
]
|
||||
},
|
||||
"Crawl VIC Register": { "main": [[ { "node": "Wait for Crawlers", "type": "main", "index": 0 } ]] },
|
||||
"Crawl Funerals Australia": { "main": [[ { "node": "Wait for Crawlers", "type": "main", "index": 0 } ]] },
|
||||
"Crawl NFDA": { "main": [[ { "node": "Wait for Crawlers", "type": "main", "index": 0 } ]] },
|
||||
"Wait for Crawlers": { "main": [[ { "node": "Deduplicate & Merge", "type": "main", "index": 0 } ]] },
|
||||
"Deduplicate & Merge": { "main": [[ { "node": "Count New Providers", "type": "main", "index": 0 } ]] },
|
||||
"Count New Providers": { "main": [[ { "node": "Any New Providers?", "type": "main", "index": 0 } ]] },
|
||||
"Any New Providers?": {
|
||||
"main": [
|
||||
[{ "node": "Build Summary", "type": "main", "index": 0 }],
|
||||
[{ "node": "No New Providers", "type": "main", "index": 0 }]
|
||||
]
|
||||
}
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
100
n8n/workflows/2_daily_website_discovery.json
Normal file
100
n8n/workflows/2_daily_website_discovery.json
Normal file
@@ -0,0 +1,100 @@
|
||||
{
|
||||
"name": "2. Daily Website Discovery",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "days", "daysInterval": 1, "triggerAtHour": 4 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Daily Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"from base import get_db; db=get_db(); n=db.execute('SELECT COUNT(*) as n FROM funeral_brand WHERE website IS NULL AND verified=0').fetchone()['n']; print(n)\" 2>&1"
|
||||
},
|
||||
"id": "check_queue",
|
||||
"name": "Check Queue Size",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"conditions": {
|
||||
"conditions": [
|
||||
{
|
||||
"id": "has_work",
|
||||
"leftValue": "={{ parseInt($json.stdout.trim()) }}",
|
||||
"rightValue": 0,
|
||||
"operator": { "type": "number", "operation": "gt" }
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"id": "has_work",
|
||||
"name": "Providers Need Websites?",
|
||||
"type": "n8n-nodes-base.if",
|
||||
"typeVersion": 2.2,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 lookup_abn.py --limit=100 2>&1"
|
||||
},
|
||||
"id": "abn_lookup",
|
||||
"name": "ABN Lookup (batch 100)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [950, 200]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 discover_websites.py --limit=100 2>&1"
|
||||
},
|
||||
"id": "discover",
|
||||
"name": "Discover Websites (batch 100)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [1250, 200]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const output = $input.first().json.stdout || '';\nconst foundMatch = output.match(/(\\d+) websites found/);\nconst found = foundMatch ? parseInt(foundMatch[1]) : 0;\nreturn [{ json: { message: `Website discovery batch complete. ${found} websites found.`, output } }];"
|
||||
},
|
||||
"id": "summary",
|
||||
"name": "Build Summary",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1500, 200]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "return [{ json: { message: 'No providers need website discovery.' } }];"
|
||||
},
|
||||
"id": "skip",
|
||||
"name": "Skip",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [950, 420]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Daily Schedule": { "main": [[ { "node": "Check Queue Size", "type": "main", "index": 0 } ]] },
|
||||
"Check Queue Size": { "main": [[ { "node": "Providers Need Websites?", "type": "main", "index": 0 } ]] },
|
||||
"Providers Need Websites?": {
|
||||
"main": [
|
||||
[{ "node": "ABN Lookup (batch 100)", "type": "main", "index": 0 }],
|
||||
[{ "node": "Skip", "type": "main", "index": 0 }]
|
||||
]
|
||||
},
|
||||
"ABN Lookup (batch 100)": { "main": [[ { "node": "Discover Websites (batch 100)", "type": "main", "index": 0 } ]] },
|
||||
"Discover Websites (batch 100)": { "main": [[ { "node": "Build Summary", "type": "main", "index": 0 } ]] }
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
146
n8n/workflows/3_daily_enrichment.json
Normal file
146
n8n/workflows/3_daily_enrichment.json
Normal file
@@ -0,0 +1,146 @@
|
||||
{
|
||||
"name": "3. Daily Website Enrichment",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "days", "daysInterval": 1, "triggerAtHour": 6 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Daily Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 enrich_websites.py --limit=50 2>&1"
|
||||
},
|
||||
"id": "enrich",
|
||||
"name": "Crawl & Extract (batch 50)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300],
|
||||
"executeOnce": true
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"\nimport json, sqlite3\ndb = sqlite3.connect('/opt/database/providers.db')\ndb.row_factory = sqlite3.Row\nrows = db.execute('''\n SELECT sr.id, sr.source_url, sr.matched_brand_id,\n json_extract(sr.raw_data, \\\"$.pricing_text\\\") as pricing_text,\n json_extract(sr.raw_data, \\\"$.has_pricing\\\") as has_pricing\n FROM source_record sr\n WHERE sr.source_name = 'website_crawl'\n AND sr.processed_at IS NULL\n AND json_extract(sr.raw_data, \\\"$.has_pricing\\\") = 1\n LIMIT 20\n''').fetchall()\nresult = [{'id': r['id'], 'brand_id': r['matched_brand_id'], 'url': r['source_url'], 'text_length': len(r['pricing_text'] or '')} for r in rows]\nprint(json.dumps(result))\n\" 2>&1"
|
||||
},
|
||||
"id": "get_queue",
|
||||
"name": "Get Pricing Pages Queue",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const output = $input.first().json.stdout.trim();\ntry {\n const items = JSON.parse(output);\n return items.map(item => ({ json: item }));\n} catch(e) {\n return [{ json: { error: 'No pricing pages to process', raw: output } }];\n}"
|
||||
},
|
||||
"id": "parse_queue",
|
||||
"name": "Parse Queue Items",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [950, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"conditions": {
|
||||
"conditions": [
|
||||
{
|
||||
"id": "has_text",
|
||||
"leftValue": "={{ $json.text_length }}",
|
||||
"rightValue": 100,
|
||||
"operator": { "type": "number", "operation": "gt" }
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"id": "has_text",
|
||||
"name": "Has Pricing Text?",
|
||||
"type": "n8n-nodes-base.if",
|
||||
"typeVersion": 2.2,
|
||||
"position": [1200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "={{ 'cd /opt/crawlers && python3 -c \"import json, sqlite3; db=sqlite3.connect(\\'/opt/database/providers.db\\'); r=db.execute(\\'SELECT json_extract(raw_data, \\\\\\\"$.pricing_text\\\\\\\") as t FROM source_record WHERE id=' + $json.id + '\\').fetchone(); print(r[0][:6000] if r and r[0] else \\'\\')\"' }}"
|
||||
},
|
||||
"id": "get_text",
|
||||
"name": "Get Pricing Text",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [1450, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"url": "https://api.anthropic.com/v1/messages",
|
||||
"sendHeaders": true,
|
||||
"headerParameters": {
|
||||
"parameters": [
|
||||
{ "name": "x-api-key", "value": "={{ $env.ANTHROPIC_API_KEY }}" },
|
||||
{ "name": "anthropic-version", "value": "2023-06-01" },
|
||||
{ "name": "content-type", "value": "application/json" }
|
||||
]
|
||||
},
|
||||
"sendBody": true,
|
||||
"specifyBody": "json",
|
||||
"jsonBody": "={{ JSON.stringify({ model: 'claude-haiku-4-5-20251001', max_tokens: 2048, messages: [{ role: 'user', content: 'Extract funeral packages and pricing from this funeral director\\'s pricing page. Return ONLY valid JSON matching this schema:\\n\\n{\\n \"packages\": [\\n {\\n \"name\": \"Package name\",\\n \"funeralType\": \"one of: Service & Cremation, Service & Burial, Cremation Only, Graveside Burial\",\\n \"price\": 0,\\n \"inclusions\": [\\n {\"item\": \"Inclusion name\", \"price\": 0, \"optional\": false, \"complimentary\": false}\\n ]\\n }\\n ]\\n}\\n\\nUse these inclusion type names where possible: Professional Service Fee, Transportation Service Fee, Professional Mortuary Care, Death Registration Certificate, Cremation Certificate/Permit, Government Levy, Accommodation, Viewing Fee, Coffin, Cremation Fee, Saturday Service Fee, Dressing Fee, Embalming, Digital Recording, Webstreaming, After Hours Transfer Surcharge.\\n\\nIf a price cannot be determined, use null. If no packages/pricing found, return {\"packages\": []}.\\n\\nPricing page text:\\n' + $('Get Pricing Text').first().json.stdout.substring(0, 5000) }] }) }}"
|
||||
},
|
||||
"id": "ai_extract",
|
||||
"name": "AI Extract (Claude Haiku)",
|
||||
"type": "n8n-nodes-base.httpRequest",
|
||||
"typeVersion": 4.2,
|
||||
"position": [1700, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const response = $input.first().json;\nconst sourceId = $('Parse Queue Items').first().json.id;\nconst brandId = $('Parse Queue Items').first().json.brand_id;\n\nlet packages = [];\ntry {\n const content = response.content[0].text;\n // Extract JSON from the response (may be wrapped in markdown)\n const jsonMatch = content.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n const parsed = JSON.parse(jsonMatch[0]);\n packages = parsed.packages || [];\n }\n} catch(e) {\n // AI response wasn't valid JSON\n}\n\nreturn [{ json: { sourceId, brandId, packages, packageCount: packages.length } }];"
|
||||
},
|
||||
"id": "parse_ai",
|
||||
"name": "Parse AI Response",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1950, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "={{ 'cd /opt/crawlers && python3 -c \"\\nimport json, sqlite3\\ndb = sqlite3.connect(\\'/opt/database/providers.db\\')\\npackages = ' + JSON.stringify(JSON.stringify($json.packages)) + '\\npackages = json.loads(packages)\\nbrand_id = ' + $json.brandId + '\\nsource_id = ' + $json.sourceId + '\\n\\nfor pkg in packages:\\n if not pkg.get(\\'price\\'):\\n continue\\n cur = db.execute(\\n \\'INSERT INTO package (title, funeral_type, brand_id, source_url, extraction_confidence) VALUES (?, ?, ?, ?, ?)\\',\\n (pkg[\\'name\\'], pkg.get(\\'funeralType\\'), brand_id, \\'ai_extraction\\', 0.7)\\n )\\n pkg_id = cur.lastrowid\\n for inc in pkg.get(\\'inclusions\\', []):\\n if inc.get(\\'price\\') is not None:\\n db.execute(\\n \\'INSERT INTO package_inclusion (price, optional, complimentary, inclusion_type_title, package_id) VALUES (?, ?, ?, ?, ?)\\',\\n (inc[\\'price\\'], 1 if inc.get(\\'optional\\') else 0, 1 if inc.get(\\'complimentary\\') else 0, inc[\\'item\\'], pkg_id)\\n )\\n\\ndb.execute(\\'UPDATE source_record SET processed_at=datetime(\\\\\\'now\\\\\\') WHERE id=?\\', (source_id,))\\ndb.execute(\\'UPDATE funeral_brand SET enrichment_status=\\\\\\'complete\\\\\\', last_enriched_at=datetime(\\\\\\'now\\\\\\') WHERE id=?\\', (brand_id,))\\ndb.commit()\\nprint(f\\'{len(packages)} packages saved for brand {brand_id}\\')\\n\" 2>&1' }}"
|
||||
},
|
||||
"id": "save_packages",
|
||||
"name": "Save Packages to DB",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [2200, 240]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 compute_tiers.py 2>&1"
|
||||
},
|
||||
"id": "recompute_tiers",
|
||||
"name": "Recompute Listing Tiers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [2450, 300]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Daily Schedule": { "main": [[ { "node": "Crawl & Extract (batch 50)", "type": "main", "index": 0 } ]] },
|
||||
"Crawl & Extract (batch 50)": { "main": [[ { "node": "Get Pricing Pages Queue", "type": "main", "index": 0 } ]] },
|
||||
"Get Pricing Pages Queue": { "main": [[ { "node": "Parse Queue Items", "type": "main", "index": 0 } ]] },
|
||||
"Parse Queue Items": { "main": [[ { "node": "Has Pricing Text?", "type": "main", "index": 0 } ]] },
|
||||
"Has Pricing Text?": {
|
||||
"main": [
|
||||
[{ "node": "Get Pricing Text", "type": "main", "index": 0 }],
|
||||
[{ "node": "Recompute Listing Tiers", "type": "main", "index": 0 }]
|
||||
]
|
||||
},
|
||||
"Get Pricing Text": { "main": [[ { "node": "AI Extract (Claude Haiku)", "type": "main", "index": 0 } ]] },
|
||||
"AI Extract (Claude Haiku)": { "main": [[ { "node": "Parse AI Response", "type": "main", "index": 0 } ]] },
|
||||
"Parse AI Response": { "main": [[ { "node": "Save Packages to DB", "type": "main", "index": 0 } ]] },
|
||||
"Save Packages to DB": { "main": [[ { "node": "Recompute Listing Tiers", "type": "main", "index": 0 } ]] }
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
65
n8n/workflows/4_monthly_refresh.json
Normal file
65
n8n/workflows/4_monthly_refresh.json
Normal file
@@ -0,0 +1,65 @@
|
||||
{
|
||||
"name": "4. Monthly Re-enrichment",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"rule": {
|
||||
"interval": [{ "field": "months", "monthsInterval": 1, "triggerAtDayOfMonth": 1, "triggerAtHour": 3 }]
|
||||
}
|
||||
},
|
||||
"id": "schedule",
|
||||
"name": "Monthly Schedule",
|
||||
"type": "n8n-nodes-base.scheduleTrigger",
|
||||
"typeVersion": 1.2,
|
||||
"position": [200, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 -c \"\nimport sqlite3\ndb = sqlite3.connect('/opt/database/providers.db')\n# Reset enrichment for providers last checked > 30 days ago\nupdated = db.execute('''\n UPDATE funeral_brand\n SET enrichment_status = 'pending',\n updated_at = datetime('now')\n WHERE verified = 0\n AND website IS NOT NULL\n AND last_enriched_at < datetime('now', '-30 days')\n''').rowcount\ndb.commit()\nprint(f'{updated} providers queued for re-enrichment')\n\" 2>&1"
|
||||
},
|
||||
"id": "reset_stale",
|
||||
"name": "Queue Stale Providers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [450, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 enrich_websites.py --limit=200 2>&1"
|
||||
},
|
||||
"id": "re_enrich",
|
||||
"name": "Re-enrich (batch 200)",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [700, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "cd /opt/crawlers && python3 compute_tiers.py 2>&1"
|
||||
},
|
||||
"id": "recompute",
|
||||
"name": "Recompute Tiers",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [950, 300]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"jsCode": "const output = $input.first().json.stdout || '';\nreturn [{ json: { message: 'Monthly re-enrichment complete.', output } }];"
|
||||
},
|
||||
"id": "summary",
|
||||
"name": "Summary",
|
||||
"type": "n8n-nodes-base.code",
|
||||
"typeVersion": 2,
|
||||
"position": [1200, 300]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Monthly Schedule": { "main": [[ { "node": "Queue Stale Providers", "type": "main", "index": 0 } ]] },
|
||||
"Queue Stale Providers": { "main": [[ { "node": "Re-enrich (batch 200)", "type": "main", "index": 0 } ]] },
|
||||
"Re-enrich (batch 200)": { "main": [[ { "node": "Recompute Tiers", "type": "main", "index": 0 } ]] },
|
||||
"Recompute Tiers": { "main": [[ { "node": "Summary", "type": "main", "index": 0 } ]] }
|
||||
},
|
||||
"settings": { "executionOrder": "v1" },
|
||||
"tags": [{ "name": "funeral-arranger" }]
|
||||
}
|
||||
Reference in New Issue
Block a user