Files
Provider-Crawl/n8n/PROCESS.md
Richie cc91427789 Initial commit: funeral provider discovery pipeline
Python crawlers for VIC Register, Funerals Australia, NFDA
n8n workflows for scheduled discovery and enrichment
SQLite schema and seeded dev database (1,463 providers)
End-to-end process documentation in n8n/PROCESS.md
2026-04-24 10:27:08 +10:00

12 KiB
Raw Blame History

Provider Discovery Pipeline — End-to-End Process

Plain-English walkthrough of what the n8n workflows do, in what order, and how the data they produce lands in the database.

The four workflows in workflows/ together form a continuous pipeline: Discover → Find websites → Enrich with pricing → Refresh periodically. Each workflow is an n8n schedule that shells out to Python scripts in /opt/crawlers (the crawlers/ folder, mounted into the n8n container).


The big picture

We're trying to populate the site with every funeral director in Australia, even before they've signed up with us. A provider starts life as a name and phone number from a public register and progressively gets enriched — website, description, packages, prices — until it either has enough data to be useful, or we've exhausted what's publicly available.

All discovered providers are hidden by default (funeral_brand.hidden = 1) and unverified (verified = 0) until an admin reviews them. The pipeline never modifies a provider that has signed up (verified = 1) — those are treated as authoritative.

A provider's data quality is summarised by a listing_tier:

Tier Means
listed Contact details only — we know the business exists
estimated At least one package with a total price
priced Two or more packages with itemised line items
verified Signed-up partner (set manually, not by the pipeline)

The tier is recomputed after every enrichment pass and drives what the frontend shows.


Workflow 1 — Weekly Discovery

Runs: Mondays at 02:00 AEST File: workflows/1_weekly_discovery.json

What it does

Three source crawlers run in parallel against public registers:

  1. VIC Consumer Affairs Register (crawl_vic_register.py) — ~796 Victorian funeral directors, scraped from the government register HTML.
  2. Funerals Australia (crawl_funerals_australia.py) — ~997 members, fetched from their AJAX member-search API.
  3. NFDA (crawl_nfda.py) — ~209 records from their WordPress store-locator API.

Each crawler writes its raw response to source_record and logs the run to source_log. Then the merge step waits for all three to finish and dedup.py runs, which is the interesting part: it matches records across sources by a combination of fuzzy name + postcode + (when available) ABN, merges duplicates into a single funeral_brand row, and attaches the per-source records to it.

Finally n8n queries how many new listed-tier providers appeared in the last 7 days and emits a summary.

Where the data lands

  • source_log — one row per crawler run (start/finish, counts, errors).
  • source_record — one row per raw record pulled from each source (e.g. a VIC Register entry). raw_data is the JSON as retrieved; normalized_data is the cleaned version.
  • funeral_brand — one row per unique business (post-dedup). Receives title, phone, email, website (if the source provided one), business_address, business_suburb, business_state, business_postcode, source_key, source_url. hidden = 1, verified = 0, enrichment_status = 'pending', listing_tier = 'listed'.
  • location — one or more rows per brand (multi-location providers). Receives title, address, suburb, state, postcode, lat/lng where the source provides them.
  • source_record.matched_brand_id — back-pointer to the funeral_brand row that each raw record was merged into, with match_type indicating how (e.g. abn, name_postcode, fuzzy_name).

Workflow 2 — Daily Website Discovery

Runs: Every day at 04:00 AEST File: workflows/2_daily_website_discovery.json

What it does

For providers where funeral_brand.website IS NULL, tries to find a website in two passes:

  1. ABN Lookup (lookup_abn.py) — calls the free Australian Business Register API to validate the business is real and attach a verified ABN + registered state/postcode. This doesn't find websites, but it strengthens the dedup key and marks the business as active.
  2. Website discovery (discover_websites.py) — uses three strategies in order:
    • Serper.dev — Google-backed search ("{business name} {suburb} {state}"), takes the first non-directory result. 2,500 free queries.
    • DuckDuckGo lite — free fallback when Serper isn't configured or exhausted.
    • URL guessing — generates plausible domains from the business name (e.g. smithfunerals.com.au) and checks if they're live.

Each candidate URL is fetched and validated: the page must load, the title/body must mention the business name, and the domain must not be a known directory (Yellow Pages, True Local, etc.). A confidence level (confirmed/probable/unverified) is recorded.

Each run processes a batch of 100 providers. With ~469 needing websites, a fresh dataset fills up in ~5 days.

Where the data lands

  • funeral_brand.abn — from ABR lookup.
  • funeral_brand.website — the validated URL, if found.
  • funeral_brand.business_state / business_postcode — overwritten with ABR values if they were missing or lower-quality.
  • source_record — a new row with source_name = 'website_discovery' capturing the search query, all candidates considered, and why each was rejected. Useful for audit.

Workflow 3 — Daily Enrichment

Runs: Every day at 06:00 AEST File: workflows/3_daily_enrichment.json

This is the most complex workflow and the one that produces pricing data. It has two phases.

Phase A — Crawl websites (Python)

enrich_websites.py --limit=50 runs first, picking up providers where website IS NOT NULL AND enrichment_status = 'pending'. For each:

  1. Fetch the homepage; extract meta description into funeral_brand.description.
  2. Try ~20 common pricing URL patterns (/pricing, /packages, /funeral-costs, /transparency, etc.), parse the sitemap, and follow any link whose text contains "pric", "packag", "cost", or "service".
  3. If a pricing page is found, save the cleaned body text. If a pricing PDF is linked, record its URL.
  4. Write the result to source_record as source_name = 'website_crawl'raw_data includes pricing_text, pricing_url, pdf_links, has_pricing flag.

At this point we have raw pricing text but no structured packages yet.

Phase B — AI extraction (n8n + Claude Haiku)

n8n then queries source_record for unprocessed website crawls that have pricing text (>100 chars):

  1. For each, it pulls the full pricing text (up to 5000 chars).
  2. Sends it to Claude Haiku with a strict JSON schema prompt asking for packages, funeral types, prices, and inclusions. The prompt constrains funeralType to the five allowed enum values and nudges toward the 16 standard inclusion type names.
  3. Parses the JSON response (tolerant of markdown wrapping).
  4. Inserts the packages and inclusions back into the DB.
  5. Marks the source record processed and the brand as enrichment_status = 'complete'.

Finally compute_tiers.py runs and promotes brands whose new data now meets the estimated or priced thresholds.

Batch size is 20 AI extractions per run. At ~$0.002 per call, a full 469-provider pass costs ~$1.

Where the data lands

  • funeral_brand.description — from meta tags on the homepage.
  • funeral_brand.enrichment_status'complete' on success, 'partial' or 'failed' otherwise.
  • funeral_brand.last_enriched_at — timestamp, used by Workflow 4.
  • source_recordsource_name = 'website_crawl' with raw_data.pricing_text, pricing_url, pdf_links, has_pricing. processed_at is set once AI extraction completes.
  • package — one row per package found. title, funeral_type (constrained enum), brand_id, source_url = 'ai_extraction', extraction_confidence = 0.7.
  • package_inclusion — one row per line item inside each package. price, optional, complimentary, inclusion_type_title, package_id.
  • funeral_brand.listing_tier — recomputed by compute_tiers.py.

How the listing tier gets computed

compute_tiers.py looks at each brand's packages:

  • 2+ packages, each with at least one priced inclusion → priced.
  • 1+ packages with a total price → estimated.
  • Everything else → listed.
  • verified = 1 always beats the computed tier.

Workflow 4 — Monthly Refresh

Runs: 1st of each month at 03:00 AEST File: workflows/4_monthly_refresh.json

What it does

Pricing changes. Providers update their sites, add packages, drop services. This workflow keeps the dataset fresh:

  1. Find providers where verified = 0 AND website IS NOT NULL AND last_enriched_at < 30 days ago.
  2. Set their enrichment_status back to 'pending'.
  3. Re-run enrich_websites.py --limit=200 against them — this re-crawls pricing pages and writes fresh source_record rows (old ones are kept for audit/history).
  4. Workflow 3 will then pick them up over the following days for AI re-extraction.
  5. compute_tiers.py runs to catch any tier changes.

New packages are inserted alongside old ones; compute_tiers looks at the current set. (A cleanup of stale packages isn't wired up yet — noted in crawlers/PIPELINE.md as a future improvement.)

Where the data lands

Same tables as Workflow 3, but you'll see multiple source_record rows per brand over time, which forms a change history.


Schema summary

funeral_brand (the provider — one per business)
  ├─ location (1..n — physical premises with lat/lng)
  ├─ package (0..n — a pricing offering)
  │    └─ package_inclusion (0..n — line items inside the package)
  ├─ known_for (0..n — descriptive tags, not yet populated by pipeline)
  └─ brand_funeral_area (many-to-many → funeral_area — service coverage, not yet populated)

source_log (one per crawler run)
source_record (one per raw record from a source, linked back to funeral_brand)

Pipeline never touches funeral_home (the parent corporation, e.g. InvoCare) or funeral_area (service area definitions) — those are populated manually or from other processes.

Columns the pipeline writes vs. leaves alone

Column Written by Notes
funeral_brand.title WF1 From source registries
funeral_brand.phone, email WF1 From source registries
funeral_brand.website WF1 or WF2 Source registry if given, else discovered
funeral_brand.abn WF2 From ABR
funeral_brand.description WF3 Meta tags
funeral_brand.business_* WF1/WF2 Preferring ABR values where available
funeral_brand.enrichment_status WF3/WF4 State machine: pending → partial → complete, failed on error
funeral_brand.last_enriched_at WF3 Used by WF4 for staleness check
funeral_brand.listing_tier compute_tiers.py After WF3/WF4
funeral_brand.source_key, source_url WF1 Immutable once set
funeral_brand.verified, hidden Never written by pipeline Admin-only
funeral_brand.background_colour, foreground_colour, modal_description, funeral_home_id Never written by pipeline Admin/branding concern
package.* WF3 (Claude Haiku) source_url = 'ai_extraction', confidence 0.7
package_inclusion.* WF3 (Claude Haiku) inclusion_type_title pulled from a 16-item vocabulary
location.* WF1 lat/lng only when source provides; google_place_key/rating require Places API (not yet wired)

The admin review flow (out of pipeline scope)

A provider stays hidden = 1 until an admin reviews it. The intended flow (not yet built — listed under "What's left to do" in the memory) is:

  1. Admin UI lists newly enriched brands, sorted by tier.
  2. Admin sets hidden = 0 to publish. They can also set verified = 1 if the provider has signed on as a partner — this protects them from future pipeline updates.

Running manually vs. via n8n

Everything n8n does can be reproduced with shell commands. The crawlers/run_overnight.sh script is effectively a single-pass equivalent of Workflows 13 back-to-back, useful for local testing or if n8n isn't available.

The n8n workflows are the production scheduler — they batch smaller chunks, run them at sensible hours (keeping server load and external API rate limits in mind), and handle the Claude Haiku HTTP calls natively (the Python scripts don't do AI extraction; they only prepare the text for n8n to send).

See README.md in this folder for setup.