AI Event Ingestion Pipeline

The single most time-consuming manual task today: finding and building events from partner venue pages. This is the highest-ROI place to prove out AI automation. Design below is independent of the legacy stack — vee.py is kept as a reference for what data sources exist and what's hard (JS-heavy calendars needing browser automation), not as a template to port.

Event sources, per venue

A venue can have zero or more configured sources:

Own website calendar (venues.calendar_url) — scraped
Manual — curator entry via the admin portal
User-submitted — the public "Submit an Event" form (see homepage design in nextjs-frontend.md). Lands as source='user_submitted', status='draft' — same review-before-publish treatment as scraped events, not published directly just because a human typed it; public submissions are exactly the kind of input worth a spam/quality check.

Facebook is not a planned ongoing source — see below.

Grouped-venue dedup: if a venue has venue_group_id set (same real-world place listed on multiple sites — confirmed for 51% of 6thStreet's directory, see data-audit.md), the scheduler scrapes that venue's calendar once per group, not once per site. The resulting ingested_events are then offered for review independently to each linked venue's own curator — one site's editor can approve an event while another's rejects it, since publishing is still per-site — but the scrape and LLM-extraction cost is paid once, not duplicated across every site the venue happens to appear on.

Correction (updated): an earlier draft described the current legacy Facebook Events importer as pulling "structured data via the Facebook Graph API" and treated it as near-auto-publish-trustworthy. Wrong on two counts. First, I hadn't verified the legacy mechanism (still unconfirmed — its fbevents/v1 REST namespace and page source reveal nothing about the actual plugin). Second, and more importantly for the new build: Graph API access for pulling arbitrary public event data is not something a new integration can rely on at all — Meta locked down broad public Events API access starting around 2018-2020, and this isn't a "maybe it still works" situation for anything built from scratch today.

Practical consequence: the new ingestion pipeline does not plan a Facebook source. Whatever Facebook-sourced events exist on the legacy sites migrate once, as historical data, during each site's ETL — they are not a live, ongoing ingestion path in the new system. If a venue only posts events to its Facebook page and nowhere else, that venue falls back to manual entry (or, if truly necessary later, scraping the public Facebook page directly like any other website source — with the caveats that carries under Facebook's terms of service, not attempted by default).

Pipeline flow

Orchestrated as a Cloudflare Workflow (multi-step, durable, retryable — no separate always-on host needed):

scheduled trigger (Cron Trigger, per venue, e.g. nightly) → enqueues a
Workflow run, concurrency managed via Queues
        │
        ▼
  Browser Run: /markdown quick action on calendar_url
  (clean pre-rendered text, not raw HTML — better LLM input than a
  plain HTTP GET would give even when JS isn't the issue)
        │
   ┌────┴─────┐
   │ succeeds,│ JS-heavy SPA calendar widgets needing real interaction —
   │ content  │ the exact failure mode vee.py already hit: "may require
   │ looks    │ browser automation for month navigation" (and what
   │ complete │ hotel_vegas_scraper's 8+ Emo's probe scripts were for:
   └────┬─────┘ clicking through slideout panels, not just rendering JS)
        │              │
        │              ▼
        │       Browser Run + Playwright/Puppeteer: scripted interaction
        │       (click month-nav, expand panels) then extract
        │              │
        └──────┬───────┘
               ▼
     LLM structured extraction
     (title, start/end datetime, description, price, ticket_url)
     — one general-purpose extraction prompt/schema, not a
     hand-written CSS-selector scraper per venue. This is the
     part that actually scales past a handful of venues.
               │
               ▼
     validation pass (dates parse + are in the future,
     required fields present, confidence score)
               │
               ▼
     dedup/match against existing `events` for this venue
     (title similarity + date proximity)
               │
      ┌────────┼─────────┐
      ▼        ▼          ▼
     new   matches_existing   needs_review (low confidence /
   (write to               (update existing  ambiguous match)
   ingested_events,        event's fields if
   review_status=pending)  changed)
               │
               ▼
     curator reviews in admin portal → approve (creates/updates
     real `events` row, source='ai_ingested') or reject
               │
               ▼
     SEO generation step (see below) runs on newly-approved events

SEO generation

Runs once per new/updated event, post, or venue: an LLM drafts seo_title / seo_description (and, for posts, suggests category tags) from the content. Always written as a suggestion a curator can edit before publish — never silently overwritten on human-authored content.

Trust model — human-in-the-loop, graduating over time

Start conservative: every scraped/LLM-extracted event requires curator approval. Track per-venue-source approval history; once a source has a long enough clean streak (e.g. 20 consecutive approvals with no edits), offer the curator a per-venue "auto-publish" toggle. No source gets a shorter trust runway by default — with Facebook off the table as a live source, every ongoing source in the new pipeline is a scrape, and all of them earn trust the same way. This directly avoids the risk flagged earlier: bad extractions publishing straight to a live, SEO-indexed site.

Real-world validation: `hotel_vegas_scraper`

A working, hand-built prior-art project found on the build machine (~/hotel_vegas_scraper/) confirms both the problem and two techniques worth keeping. It's per-venue Playwright automation for exactly 3 HeyAustin venues:

Venue	Events scraped	Notes
Hotel Vegas	33	ai1ec calendar plugin, CSS-selector based
Emo's	51	needed 8+ iterative "probe" scripts for JS slideout panels/lineup widgets
Friends Bar	147	separate custom scraper

No LLM involved — it's deterministic scraping (Playwright + BeautifulSoup) with two techniques worth carrying into the real pipeline:

Template-based SEO text as the cheap default — seo_title built as "{event title} | {venue} Austin {year} | Hey Austin", meta_description as "{title} live at {venue} in Austin on {date}.". No LLM call needed for the common case; reserve LLM generation for richer copy (blog posts, venue descriptions) where simple templating can't do it.
Web-research fallback for thin content — when a venue's own event page text is under ~50 words, it falls back to a DuckDuckGo search to enrich the description rather than publishing something threadbare. Worth adopting as a fallback step in the validation stage of the real pipeline.

The cautionary half: this is exactly the per-venue hand-crafted scraper pattern the LLM-extraction design above is meant to avoid. Three venues took dozens of iteration files and real engineering time (Emo's especially). There are 700+ venues across the 4 sites — this approach doesn't scale, which is the whole reason for the generic-extraction design rather than a scraper per venue.

Cost/ops controls

Only re-scrape a venue if the calendar page's content hash changed since last run (cheap HTTP HEAD/hash check before spending an LLM call)
Use a small/cheap model for extraction by default; escalate to a stronger model only when the cheap pass returns low confidence or fails validation
Ingestion runs and their outcomes are logged (ingestion_runs table) so cost and failure rate per venue are visible, not opaque