AI Event Ingestion Pipeline
The single most time-consuming manual task today: finding and building events
from partner venue pages. This is the highest-ROI place to prove out AI
automation. Design below is independent of the legacy stack — vee.py is
kept as a reference for what data sources exist and what's hard (JS-heavy
calendars needing browser automation), not as a template to port.
Event sources, per venue
A venue can have zero or more configured sources:
- Own website calendar (
venues.calendar_url) — scraped - Manual — curator entry via the admin portal
- User-submitted — the public "Submit an Event" form (see homepage
design in
nextjs-frontend.md). Lands assource='user_submitted',status='draft'— same review-before-publish treatment as scraped events, not published directly just because a human typed it; public submissions are exactly the kind of input worth a spam/quality check.
Facebook is not a planned ongoing source — see below.
Grouped-venue dedup: if a venue has venue_group_id set (same
real-world place listed on multiple sites — confirmed for 51% of
6thStreet's directory, see data-audit.md), the scheduler scrapes that
venue's calendar once per group, not once per site. The resulting
ingested_events are then offered for review independently to each linked
venue's own curator — one site's editor can approve an event while
another's rejects it, since publishing is still per-site — but the scrape
and LLM-extraction cost is paid once, not duplicated across every site the
venue happens to appear on.
Correction (updated): an earlier draft described the current legacy Facebook Events importer as pulling "structured data via the Facebook Graph API" and treated it as near-auto-publish-trustworthy. Wrong on two counts. First, I hadn't verified the legacy mechanism (still unconfirmed — its
fbevents/v1REST namespace and page source reveal nothing about the actual plugin). Second, and more importantly for the new build: Graph API access for pulling arbitrary public event data is not something a new integration can rely on at all — Meta locked down broad public Events API access starting around 2018-2020, and this isn't a "maybe it still works" situation for anything built from scratch today.Practical consequence: the new ingestion pipeline does not plan a Facebook source. Whatever Facebook-sourced events exist on the legacy sites migrate once, as historical data, during each site's ETL — they are not a live, ongoing ingestion path in the new system. If a venue only posts events to its Facebook page and nowhere else, that venue falls back to manual entry (or, if truly necessary later, scraping the public Facebook page directly like any other website source — with the caveats that carries under Facebook's terms of service, not attempted by default).
Pipeline flow
Orchestrated as a Cloudflare Workflow (multi-step, durable, retryable — no separate always-on host needed):
scheduled trigger (Cron Trigger, per venue, e.g. nightly) → enqueues a
Workflow run, concurrency managed via Queues
│
▼
Browser Run: /markdown quick action on calendar_url
(clean pre-rendered text, not raw HTML — better LLM input than a
plain HTTP GET would give even when JS isn't the issue)
│
┌────┴─────┐
│ succeeds,│ JS-heavy SPA calendar widgets needing real interaction —
│ content │ the exact failure mode vee.py already hit: "may require
│ looks │ browser automation for month navigation" (and what
│ complete │ hotel_vegas_scraper's 8+ Emo's probe scripts were for:
└────┬─────┘ clicking through slideout panels, not just rendering JS)
│ │
│ ▼
│ Browser Run + Playwright/Puppeteer: scripted interaction
│ (click month-nav, expand panels) then extract
│ │
└──────┬───────┘
▼
LLM structured extraction
(title, start/end datetime, description, price, ticket_url)
— one general-purpose extraction prompt/schema, not a
hand-written CSS-selector scraper per venue. This is the
part that actually scales past a handful of venues.
│
▼
validation pass (dates parse + are in the future,
required fields present, confidence score)
│
▼
dedup/match against existing `events` for this venue
(title similarity + date proximity)
│
┌────────┼─────────┐
▼ ▼ ▼
new matches_existing needs_review (low confidence /
(write to (update existing ambiguous match)
ingested_events, event's fields if
review_status=pending) changed)
│
▼
curator reviews in admin portal → approve (creates/updates
real `events` row, source='ai_ingested') or reject
│
▼
SEO generation step (see below) runs on newly-approved events
SEO generation
Runs once per new/updated event, post, or venue: an LLM drafts
seo_title / seo_description (and, for posts, suggests category tags) from
the content. Always written as a suggestion a curator can edit before
publish — never silently overwritten on human-authored content.
Trust model — human-in-the-loop, graduating over time
Start conservative: every scraped/LLM-extracted event requires curator approval. Track per-venue-source approval history; once a source has a long enough clean streak (e.g. 20 consecutive approvals with no edits), offer the curator a per-venue "auto-publish" toggle. No source gets a shorter trust runway by default — with Facebook off the table as a live source, every ongoing source in the new pipeline is a scrape, and all of them earn trust the same way. This directly avoids the risk flagged earlier: bad extractions publishing straight to a live, SEO-indexed site.
Real-world validation: hotel_vegas_scraper
A working, hand-built prior-art project found on the build machine
(~/hotel_vegas_scraper/) confirms both the problem and two techniques worth
keeping. It's per-venue Playwright automation for exactly 3 HeyAustin venues:
| Venue | Events scraped | Notes |
|---|---|---|
| Hotel Vegas | 33 | ai1ec calendar plugin, CSS-selector based |
| Emo's | 51 | needed 8+ iterative "probe" scripts for JS slideout panels/lineup widgets |
| Friends Bar | 147 | separate custom scraper |
No LLM involved — it's deterministic scraping (Playwright + BeautifulSoup) with two techniques worth carrying into the real pipeline:
- Template-based SEO text as the cheap default —
seo_titlebuilt as"{event title} | {venue} Austin {year} | Hey Austin",meta_descriptionas"{title} live at {venue} in Austin on {date}.". No LLM call needed for the common case; reserve LLM generation for richer copy (blog posts, venue descriptions) where simple templating can't do it. - Web-research fallback for thin content — when a venue's own event page text is under ~50 words, it falls back to a DuckDuckGo search to enrich the description rather than publishing something threadbare. Worth adopting as a fallback step in the validation stage of the real pipeline.
The cautionary half: this is exactly the per-venue hand-crafted scraper pattern the LLM-extraction design above is meant to avoid. Three venues took dozens of iteration files and real engineering time (Emo's especially). There are 700+ venues across the 4 sites — this approach doesn't scale, which is the whole reason for the generic-extraction design rather than a scraper per venue.
Cost/ops controls
- Only re-scrape a venue if the calendar page's content hash changed since last run (cheap HTTP HEAD/hash check before spending an LLM call)
- Use a small/cheap model for extraction by default; escalate to a stronger model only when the cheap pass returns low confidence or fails validation
- Ingestion runs and their outcomes are logged (
ingestion_runstable) so cost and failure rate per venue are visible, not opaque