Data Audit — All 4 Sites

Live recon against production WordPress sites, 2026-07-05. Goal: know the real shape and scale of the data before building migration ETL, not guess.

Mobile apps — confirmed

HeyAustin, LakeTravis, and CrestedButte all have live mobile apps, running on the same Listar REST backend (listar/v1 + jwt-auth/v1). 6thStreet does not have an app and does not run this backend at all.

Content volume

Site Venues Events (live, via API) Blog Posts Rentals Listar backend
HeyAustin 233 1,212 45 452 yes
LakeTravis 257 1,709 58 656 yes
CrestedButte 177 741 45 217 yes
6thStreet 107 not API-accessible (~6,087 URLs in sitemap, historical) 14 543 no

Takeaways:

6thStreet: a genuinely different build, not just a smaller one

Migration consequence: HeyAustin/LakeTravis/CrestedButte event data migrates cleanly via REST API pulls. 6thStreet's event history can only be reached by scraping rendered HTML pages or getting direct database/export access from whoever hosts it — a heavier, separate migration path for that one site specifically. Worth deciding early whether 6thStreet's ~6,000 historical event pages are worth migrating at all, or whether only future-dated/recent ones matter (most are almost certainly expired, one-time past events with no ongoing value beyond incidental SEO).

Correction: Facebook Graph API is not usable for the new build

Earlier notes described the Facebook Events importer as pulling "structured data via the Facebook Graph API" and treated it as a trustworthy, near-auto- publish source. Confirmed wrong, and more importantly, moot: Facebook Graph API access for pulling public event data is not an option for the new platform, period — Meta locked this down years ago (~2018-2020), and no new integration can be built to rely on it. The new ingestion pipeline plans zero Facebook dependency (see ai-ingestion-pipeline.md) — whatever Facebook-sourced events exist on the legacy sites migrate once as historical data; they are not an ongoing source. The legacy plugin's actual mechanism (scraping, a stale integration, or a paid proxy service) remains unconfirmed and is no longer relevant to the new design either way.

Category & region taxonomy audit

Pulled real taxonomy term lists from all 4 sites (job-categories for venues, job_listing_region for regions, event_type for events). Two very different pictures emerged.

Venue categories: clean, hierarchical, and portable as-is

Real hierarchy example (LakeTravis): 38 categories under parents like "Restaurants" (38 venues), "Boat Rentals" (37), "Event Venue" (22), "Marinas" (21) — sensible, well-maintained, clearly curated by hand over time. Same quality bar on HeyAustin, CrestedButte, and 6thStreet.

Structural finding: HeyAustin, CrestedButte, and 6thStreet share the same parent category IDs (3500 = dining-ish, 3501 = entertainment-ish, 3513 = services, 3527 = lodging, 3533 = shopping, 3542 = arts) — they were clearly built from one shared starting template. LakeTravis has entirely different parent IDs (3774-3805) with its own top-level structure shaped around its lake-recreation focus (Boat Service, Marine Services, Yacht Charters as top-level groups that don't exist elsewhere). CrestedButte adds its own ski/outdoor-specific top level (parent 7194: Mountain Biking, Backcountry Skiing, Downhill Skiing, Nordic Skiing, Snowmobiling).

Recommendation: port venue categories directly, but design the new schema's category set as a shared canonical vocabulary (Restaurants, Bars, Live Entertainment, Lodging, Shopping, Services, Arts) plus brand-specific extensions (LakeTravis's boating categories, CrestedButte's winter-sports categories) rather than 4 fully independent flat lists — 3 of 4 sites already prove this shared structure works, and it's what lets the AI search and cross-brand admin views reason about categories consistently.

Regions: clean, flat, and reveal real geographic overlap

All region taxonomies are flat (no parent hierarchy) — real neighborhoods/ towns per brand's geography (HeyAustin: Dirty 6th Street, East Austin, Lake Travis, South Congress, etc.; LakeTravis: Lakeway, Volente, Spicewood, etc.; CrestedButte: Town of Crested Butte, Gunnison, Mount Crested Butte, etc.).

Notable overlap: 6thStreet's 4 regions (Old/East/West 6th Street, Red River District) are a literal subset of HeyAustin's regions of the same name. Strong signal that some venues are likely listed on both HeyAustin and 6thStreet — worth checking for actual duplicate venues during ETL (same address/phone across both sites) rather than assuming all 4 brands' content is fully disjoint.

Event categories (event_type): a mostly-unusable folksonomy — do not port as-is

Very different quality bar from venue categories. Pulled the full list (112 terms on HeyAustin/CrestedButte, 129 on LakeTravis) and most of it is noise: one-off auto-generated tags like "1138 studios", "gino barasa", "south congress bookstore", "murals in austin" sit alongside genuinely useful categories like "Festival" and "Entertainment", with no consistent curation. Spot-checking a sample event (LakeTravis's "Sunday Funday at Volente Beach") showed category: nullmost events likely have no category assigned at all, and the small set that do are inconsistently tagged. A field_reference note in the hotel_vegas_scraper prior art mentioned specific term IDs "7201=Food & Drink, 7202=Live Music" — those IDs don't exist in the current 112-term list (max ID found: 3763), so either that note is stale, those IDs live in a different taxonomy than what the public API exposes, or the terms were since deleted. Unresolved, but doesn't change the recommendation either way.

Recommendation: don't port event_type as historical categorization signal — it isn't reliable enough to trust. Instead, design a small, clean, curated event-category set for the new platform (Live Music, Food & Drink, Nightlife, Arts & Culture, Family, Sports, Festivals) and use the AI ingestion pipeline itself to backfill categories on migrated historical events via LLM classification against that clean list — a good, concrete first use of the classification capability the pipeline already needs to build anyway, rather than trusting data that was never consistently curated in the first place.

Data completeness: geolocation and website coverage

Lat/lng: 100% complete. Checked every venue on all 3 Listar sites (667 total: 233 HeyAustin + 257 LakeTravis + 177 CrestedButte) via the list endpoint — zero missing coordinates anywhere. Mapping/geolocation features have a fully clean foundation to build on; no data-repair work needed here.

Website URL: ~4-8% missing, sampled (25 venues per site via the detail endpoint, since website only appears there, not in the list view): HeyAustin 2/25, LakeTravis 2/25, CrestedButte 1/25 missing. Extrapolated across ~667 venues, that's roughly 35-55 venues with no website on file at all. Two things worth noting:

HeyAustin / 6thStreet venue overlap: confirmed, and larger than expected

Compared all 233 HeyAustin venues against all 107 6thStreet venues by normalized name. 55 venues (51% of 6thStreet's entire directory) match a HeyAustin venue by name, and spot-checking confirms these are genuinely the same real-world places, not coincidental name collisions — e.g. both sites have "Friends Bar" at matching slugs (friends-bar on both), "Hotel Vegas," "Flamingo Cantina," "24 Diner," "Vulcan Gas Company," etc. This lines up with the region-overlap finding (6thStreet's 4 regions are literally a subset of HeyAustin's) — 6thStreet functions largely as a curated subset of HeyAustin's downtown venues, not a fully independent directory.

Important nuance: this is not copy-paste duplication. Checked "Friends Bar" content on both sites — the descriptions are independently written (different opening lines, different framing), not the same text posted twice. So today, curators are maintaining two separately-authored listings for the same physical venue, and (relevant to the ingestion pipeline) that venue's event calendar would get scraped and processed twice — once per site — under the current one-venue-belongs-to-one-site schema design.

This changes a schema assumption. The current venues table models one venue as belonging to exactly one site (site_id FK, unique(site_id, slug)). Given ~55 real-world venues need to appear on two sites at once, worth deciding between:

  1. Many-to-many: venues become site-independent entities, with a join table (venue_sites) controlling which site(s) a venue appears on, plus optional per-site override fields (description, category, featured image) since editorial voice does genuinely differ today. One venue, one set of core facts (address/phone/lat-lng), one ingestion target — shared across sites it's linked to.
  2. Keep separate, but link: preserve today's independent-listing model (two rows, two independently-edited descriptions) but add a nullable self-referential linked_venue_id so the admin portal can surface "this venue also exists on HeyAustin" and, more importantly, the ingestion pipeline dedupes — scrape once, apply the resulting events to every linked venue row instead of scraping the same calendar twice.

Decided: option 2. Implemented as a venue_groups table + nullable venue_group_id on venues (see schema/001_initial_schema.sql) — rows stay separate and independently edited per site, linked only so the admin portal can surface "also listed on X" and the ingestion pipeline scrapes each real-world venue once, not once per site.

Rentals: much smaller, and likely coincidental overlap — not the same problem

Ran the identical name-match check against rentals: 44 of 543 6thStreet rentals (8.1%) match a HeyAustin rental, versus 51% for venues. Checked matched pairs' content directly (not just names this time) and it's literally identical on both sides — e.g. "Deluxe Studio" and "Brand new hideaway in the heart of the city..." both have trivial, auto-generated- looking content (just the title repeated, empty meta fields) on both sites.

This is a different situation from venues, not the same pattern at smaller scale:

Recommendation: rentals don't need a venue_groups-style linking mechanism — there's no editorial-duplication cost to solve for. The only relevant concern is migration-time dedup (don't import the same physical property as two unrelated rows if both sites' data get pulled into one system), which is a one-time ETL concern, not an ongoing admin-portal feature the way it was for venues.

Events: essentially no overlap, despite the shared venues

6thStreet's events aren't REST-accessible (confirmed earlier), so compared via sitemap slugs instead — extracted 6,083 event slugs from 6thStreet's facebook_events sitemap and checked them against all 1,212 of HeyAustin's live events (via Listar API). Result: 2 matches (0.2%), and both are Austin City Limits Music Festival weekends — the kind of event that would appear on nearly any Austin site regardless of any real connection between the two brands, not evidence of duplication.

This is genuinely surprising given the venue finding: 51% of 6thStreet's venues are the same physical places as HeyAustin's, yet essentially none of the events at those shared venues overlap between the two sites. The likely explanation is that each site runs its own Facebook Events import independently and incompletely — same venue, but each site's importer pulled a different (and largely non-overlapping) slice of that venue's Facebook activity over time, rather than either site doing a complete pull.

Does this change the venue_groups ingestion design? No — and it's worth being explicit about why. The new pipeline scrapes each venue's own website_url/calendar_url, not Facebook, so the "scrape once per group" design is based on there being one real calendar per real-world venue, which is still true regardless of what the legacy Facebook imports happened to capture. This finding is more a sign that the legacy Facebook automation was inconsistent/incomplete (further reason not to trust it as a data source, on top of the earlier Graph API correction) than a signal about how the new venue-linking design should work.

6thStreet's blog: real content, but abandoned since 2020

Pulled all 14 posts with dates and word counts. The writing itself is genuine, not filler — decent length (142-986 words) and real, topically substantive local-guide pieces: "Guide to Austin's 6th Street," "Best Austin Food Pub Tours," "Best Downtown Austin Arcade Bars." This was a real editorial effort at some point, not auto-generated placeholder content.

But the dates tell the actual story: the most recent post is from January 2020 — over 6 years with no new content. For contrast, checked the other 3 sites' most recent posts: HeyAustin (2026-06-25), LakeTravis (2026-05-30), CrestedButte (2026-05-28) — all actively publishing on a recurring pattern ("Things To Do This Weekend," seasonal guides, live music calendars) within the last few weeks. 6thStreet is the only one of the four with a fully dormant blog.

Consistent with everything else found about 6thStreet: no dedicated mobile app, no Listar backend, a different plugin stack, and now a blog that's been abandoned for 6+ years while its siblings actively publish. This reads as the deprioritized brand of the four — worth deciding explicitly whether 6thStreet's blog gets revived as part of the new platform or is treated as legacy content to migrate as-is without ongoing investment.

Open items for a deeper pass (not yet done)

None remaining from this audit round — all flagged items have been checked.