AI Event Ingestion Pipeline

The single most time-consuming manual task today: finding and building events from partner venue pages. This is the highest-ROI place to prove out AI automation. Design below is independent of the legacy stack — vee.py is kept as a reference for what data sources exist and what's hard (JS-heavy calendars needing browser automation), not as a template to port.

Event sources, per venue

A venue can have zero or more configured sources:

  1. Own website calendar (venues.calendar_url) — scraped
  2. Manual — curator entry via the admin portal
  3. User-submitted — the public "Submit an Event" form (see homepage design in nextjs-frontend.md). Lands as source='user_submitted', status='draft' — same review-before-publish treatment as scraped events, not published directly just because a human typed it; public submissions are exactly the kind of input worth a spam/quality check.

Facebook is not a planned ongoing source — see below.

Grouped-venue dedup: if a venue has venue_group_id set (same real-world place listed on multiple sites — confirmed for 51% of 6thStreet's directory, see data-audit.md), the scheduler scrapes that venue's calendar once per group, not once per site. The resulting ingested_events are then offered for review independently to each linked venue's own curator — one site's editor can approve an event while another's rejects it, since publishing is still per-site — but the scrape and LLM-extraction cost is paid once, not duplicated across every site the venue happens to appear on.

Correction (updated): an earlier draft described the current legacy Facebook Events importer as pulling "structured data via the Facebook Graph API" and treated it as near-auto-publish-trustworthy. Wrong on two counts. First, I hadn't verified the legacy mechanism (still unconfirmed — its fbevents/v1 REST namespace and page source reveal nothing about the actual plugin). Second, and more importantly for the new build: Graph API access for pulling arbitrary public event data is not something a new integration can rely on at all — Meta locked down broad public Events API access starting around 2018-2020, and this isn't a "maybe it still works" situation for anything built from scratch today.

Practical consequence: the new ingestion pipeline does not plan a Facebook source. Whatever Facebook-sourced events exist on the legacy sites migrate once, as historical data, during each site's ETL — they are not a live, ongoing ingestion path in the new system. If a venue only posts events to its Facebook page and nowhere else, that venue falls back to manual entry (or, if truly necessary later, scraping the public Facebook page directly like any other website source — with the caveats that carries under Facebook's terms of service, not attempted by default).

Pipeline flow

Orchestrated as a Cloudflare Workflow (multi-step, durable, retryable — no separate always-on host needed):

scheduled trigger (Cron Trigger, per venue, e.g. nightly) → enqueues a
Workflow run, concurrency managed via Queues
        │
        ▼
  Browser Run: /markdown quick action on calendar_url
  (clean pre-rendered text, not raw HTML — better LLM input than a
  plain HTTP GET would give even when JS isn't the issue)
        │
   ┌────┴─────┐
   │ succeeds,│ JS-heavy SPA calendar widgets needing real interaction —
   │ content  │ the exact failure mode vee.py already hit: "may require
   │ looks    │ browser automation for month navigation" (and what
   │ complete │ hotel_vegas_scraper's 8+ Emo's probe scripts were for:
   └────┬─────┘ clicking through slideout panels, not just rendering JS)
        │              │
        │              ▼
        │       Browser Run + Playwright/Puppeteer: scripted interaction
        │       (click month-nav, expand panels) then extract
        │              │
        └──────┬───────┘
               ▼
     LLM structured extraction
     (title, start/end datetime, description, price, ticket_url)
     — one general-purpose extraction prompt/schema, not a
     hand-written CSS-selector scraper per venue. This is the
     part that actually scales past a handful of venues.
               │
               ▼
     validation pass (dates parse + are in the future,
     required fields present, confidence score)
               │
               ▼
     dedup/match against existing `events` for this venue
     (title similarity + date proximity)
               │
      ┌────────┼─────────┐
      ▼        ▼          ▼
     new   matches_existing   needs_review (low confidence /
   (write to               (update existing  ambiguous match)
   ingested_events,        event's fields if
   review_status=pending)  changed)
               │
               ▼
     curator reviews in admin portal → approve (creates/updates
     real `events` row, source='ai_ingested') or reject
               │
               ▼
     SEO generation step (see below) runs on newly-approved events

SEO generation

Runs once per new/updated event, post, or venue: an LLM drafts seo_title / seo_description (and, for posts, suggests category tags) from the content. Always written as a suggestion a curator can edit before publish — never silently overwritten on human-authored content.

Trust model — human-in-the-loop, graduating over time

Start conservative: every scraped/LLM-extracted event requires curator approval. Track per-venue-source approval history; once a source has a long enough clean streak (e.g. 20 consecutive approvals with no edits), offer the curator a per-venue "auto-publish" toggle. No source gets a shorter trust runway by default — with Facebook off the table as a live source, every ongoing source in the new pipeline is a scrape, and all of them earn trust the same way. This directly avoids the risk flagged earlier: bad extractions publishing straight to a live, SEO-indexed site.

Real-world validation: hotel_vegas_scraper

A working, hand-built prior-art project found on the build machine (~/hotel_vegas_scraper/) confirms both the problem and two techniques worth keeping. It's per-venue Playwright automation for exactly 3 HeyAustin venues:

Venue Events scraped Notes
Hotel Vegas 33 ai1ec calendar plugin, CSS-selector based
Emo's 51 needed 8+ iterative "probe" scripts for JS slideout panels/lineup widgets
Friends Bar 147 separate custom scraper

No LLM involved — it's deterministic scraping (Playwright + BeautifulSoup) with two techniques worth carrying into the real pipeline:

The cautionary half: this is exactly the per-venue hand-crafted scraper pattern the LLM-extraction design above is meant to avoid. Three venues took dozens of iteration files and real engineering time (Emo's especially). There are 700+ venues across the 4 sites — this approach doesn't scale, which is the whole reason for the generic-extraction design rather than a scraper per venue.

Cost/ops controls