# Provider Discovery & Enrichment Pipeline ## Architecture: Multi-Step Enrichment The pipeline builds provider profiles progressively, never relying on competitor data. Each step adds richer detail from more authoritative sources. ``` STEP 1: DISCOVER STEP 2: FIND WEBSITE STEP 3: ENRICH ───────────────── ──────────────────── ────────────── VIC Register ─────┐ ┌─ Fetch homepage NFDA Directory ───┼─▶ Basic Google Places API ──┐ │ Find /pricing page Funerals AU ──────┘ Provider ABN Lookup ─────────┼─▶ URL ──┤ Download PDFs Record Search engines ─────┘ │ AI extract packages └─▶ Structured data name website URL description address Google rating packages[] phone Google reviews inclusions[] email place_id pricing state ABN (validated) ``` ## Step 1: Discovery (DONE — all modules built and tested) Sources: - VIC Consumer Affairs Register (796 records, VIC only) → `crawl_vic_register.py` - Funerals Australia AJAX API (997 records, national) → `crawl_funerals_australia.py` - NFDA WPSL API (209 records, national) → `crawl_nfda.py` Orchestrator: `crawl_all.py` Deduplication: `dedup.py` (fuzzy name + postcode + ABN matching) Output: ~1,463 unique providers with basic contact info. Stored in: funeral_brand + location tables in `database/providers.db`. ## Step 2: Website Discovery (DONE — module built and tested) Module: `discover_websites.py` Test result: 50% success rate on initial batch (DDG search + URL guessing) Can be improved with Google Places API for higher hit rate. For each provider that lacks a website URL: ### 2a. Serper.dev — Google search API (PRIMARY) - Input: "{business name} {suburb} {state}" - Returns: Google organic search results as JSON (title, link, snippet) - Cost: **2,500 free queries** (no CC needed), then $1/1K - Covers our entire 1,463 providers for $0 - Filters out directories/aggregators, validates first result - Module: `discover_websites.py` with `search_serper()` ### 2b. DuckDuckGo lite (FALLBACK) - Free, no API key, but aggressive rate limiting - Used when Serper key not configured or quota exhausted - Module: `discover_websites.py` with `search_ddg()` ### 2c. URL pattern guessing (SUPPLEMENTARY) - Generates candidate domains from business name (e.g. smithfunerals.com.au) - HTTP HEAD to check if live, then validate content - Module: `discover_websites.py` with `guess_urls()` ### 2d. ABN Lookup — Australian Business Register (ENRICHMENT) - Input: business name + state - Returns: ABN, entity status, registered state/postcode - Cost: **FREE** (government API, requires GUID registration) - Validates business is active, gives strongest dedup key - Does NOT return website URLs - Module: `lookup_abn.py` - Register for GUID: https://abr.business.gov.au/Tools/WebServices ### 2e. Google Places API (OPTIONAL PREMIUM) - Input: "{business name}, {suburb} {state}" - Returns: website, rating, review count, place_id, formatted phone - Cost: 1,000 free/month (Enterprise tier), then ~$25/1K - Best data quality but most expensive - Not yet implemented — add when budget allows ### 2f. URL validation - Fetch discovered URL, verify it loads - Check page title/content mentions the business name - Reject generic directories (yellowpages, truelocal, etc.) - Mark confidence level: confirmed / probable / unverified ## Step 3: Website Enrichment (DONE — module built and tested) Module: `enrich_websites.py` - Finds pricing pages via 20+ URL patterns + link following - Extracts description from meta tags - Extracts contact info (phone, email, address) - Stores cleaned pricing page text for AI extraction - Detects PDF links for PDF-based pricing extraction For each provider with a confirmed website: ### 3a. Homepage crawl - Fetch homepage HTML - Extract: description/about text, contact details - Look for links to pricing/services pages ### 3b. Pricing page discovery Try common URL patterns: /pricing, /prices, /packages, /services, /our-services, /funeral-costs, /funeral-packages, /service-options, /price-list, /transparency Also: - Parse sitemap.xml if available - Follow links containing "pric", "packag", "cost", "service" - Check for PDF links on pricing pages ### 3c. AI extraction (Claude Haiku) - Send pricing page HTML to Haiku - Extract: package names, funeral types, prices, inclusions - Map to known inclusion types where possible - Return confidence score ### 3d. PDF extraction (for InvoCare-type sites) - Download compliance PDFs - Extract text (pdftotext or similar) - Send to Haiku for structured extraction - ~25% of sites are PDF-only for pricing ## Listing Tiers Providers are assigned a `listing_tier` based on data quality. Computed automatically by `compute_tiers.py` after each enrichment run. | Tier | Label | Criteria | Display | |------|-------|----------|---------| | `verified` | Full partner | `verified = true` (signed up) | Full branding, packages, arrangements | | `priced` | Full pricing | 2+ packages with itemized inclusion prices | Package comparison, line-item detail | | `estimated` | Some pricing | At least 1 package with a total price | Package prices shown, "Contact for details" on breakdowns | | `listed` | Contact only | Name + location + phone, no pricing | "Contact for pricing" CTA, upgrade prompt | Each tier below `verified` motivates the provider to sign up: - `listed` → "Publish your pricing to attract more families" - `estimated` → "Add detailed breakdowns to stand out" - `priced` → "Sign up to enable online arrangements" ## Enrichment Status Flow ``` pending ──▶ website_found ──▶ partial ──▶ complete │ │ │ └──▶ no_website_found failed (retry later) ``` ## N8N Workflow Design ### Workflow 1: Weekly Discovery Cron → Run all source crawlers → Dedup into DB → Queue new providers ### Workflow 2: Daily Website Discovery Cron → Fetch providers with no website → Google Places lookup → ABN lookup → Search fallback → Update DB ### Workflow 3: Daily Enrichment Cron → Fetch providers with website but no packages → Crawl website → AI extract → Update DB ### Workflow 4: Monthly Re-check Cron → Re-crawl enriched providers → Update pricing if changed --- ## Module Inventory | Module | Purpose | N8N Workflow | |--------|---------|-------------| | `base.py` | Shared HTTP, DB, normalization utils | Used by all | | `crawl_vic_register.py` | VIC government register (796 records) | Workflow 1 | | `crawl_funerals_australia.py` | Funerals Australia API (997 records) | Workflow 1 | | `crawl_nfda.py` | NFDA directory API (209 records) | Workflow 1 | | `crawl_all.py` | Orchestrates all source crawlers | Workflow 1 | | `dedup.py` | Cross-source dedup & merge engine | Workflow 1 | | `discover_websites.py` | Find provider websites (Serper/DDG/guess) | Workflow 2 | | `lookup_abn.py` | ABN validation via ABR API (free) | Workflow 2 | | `enrich_websites.py` | Crawl provider sites, find pricing pages | Workflow 3 | | `compute_tiers.py` | Compute listing_tier from data quality | After enrichment | | `config.example.json` | API key template | — | ## API Keys Required | Service | Key | Cost | Register | |---------|-----|------|----------| | Serper.dev | `serper_api_key` | 2,500 free, then $1/1K | https://serper.dev | | ABR (ABN Lookup) | `abr_guid` | Free | https://abr.business.gov.au/Tools/WebServices | | Anthropic (Haiku) | `anthropic_api_key` | ~$2/full run | https://console.anthropic.com | ## Quick Start ```bash # 1. Configure API keys cp config.example.json config.json # Edit config.json with your keys # 2. Reset database cd ../database sqlite3 providers.db < schema_sqlite.sql # 3. Run full discovery pipeline cd ../crawlers python3 crawl_all.py # Step 1: Discover from registries python3 dedup.py # Deduplicate across sources python3 lookup_abn.py # Step 2a: Get ABNs (free) python3 discover_websites.py # Step 2b: Find websites python3 enrich_websites.py # Step 3: Crawl for pricing python3 compute_tiers.py # Assign listing tiers # Test mode (limited records) python3 crawl_all.py --test python3 discover_websites.py --limit=10 --state=VIC python3 enrich_websites.py --limit=5 ```