Data Audit — All 4 Sites

Live recon against production WordPress sites, 2026-07-05. Goal: know the real shape and scale of the data before building migration ETL, not guess.

Mobile apps — confirmed

HeyAustin, LakeTravis, and CrestedButte all have live mobile apps, running on the same Listar REST backend (listar/v1 + jwt-auth/v1). 6thStreet does not have an app and does not run this backend at all.

Content volume

Site	Venues	Events (live, via API)	Blog Posts	Rentals	Listar backend
HeyAustin	233	1,212	45	452	yes
LakeTravis	257	1,709	58	656	yes
CrestedButte	177	741	45	217	yes
6thStreet	107	not API-accessible (~6,087 URLs in sitemap, historical)	14	543	no

Takeaways:

6thStreet's blog is a fraction of the others' (14 vs 45-58) — a much smaller content operation on that brand, or content lives elsewhere
6thStreet's rental volume (543) is second-highest despite having the fewest venues — makes sense for a downtown entertainment-district brand (short-term condos/hotel suites) vs. lake-house rentals on LakeTravis
Rentals are the same kind of content everywhere (short-term lodging) but a different property style per brand's geography — the shared rentals table in the schema still holds

6thStreet: a genuinely different build, not just a smaller one

No listar/v1, jwt-auth/v1, or wpjm-internal/v1 — instead has gravityforms/v2, which no other site has
Venues use job-listings (hyphenated post type) vs. the other sites' job_listing (underscored) — likely a different WP Job Manager version/config
Extra taxonomies not seen elsewhere: tax_business_listing, tax_feature, resume_region — signs of a different plugin/theme lineage on this brand
Its event post type has show_in_rest disabled — no REST route exists for events at all, despite the Facebook Events plugin clearly being active (fbevents/v1 namespace present, ~6,087 URLs in the facebook_events sitemap). Confirmed: this is not something the other 3 sites also have hidden — their event data is fully REST-accessible.

Migration consequence: HeyAustin/LakeTravis/CrestedButte event data migrates cleanly via REST API pulls. 6thStreet's event history can only be reached by scraping rendered HTML pages or getting direct database/export access from whoever hosts it — a heavier, separate migration path for that one site specifically. Worth deciding early whether 6thStreet's ~6,000 historical event pages are worth migrating at all, or whether only future-dated/recent ones matter (most are almost certainly expired, one-time past events with no ongoing value beyond incidental SEO).

Correction: Facebook Graph API is not usable for the new build

Earlier notes described the Facebook Events importer as pulling "structured data via the Facebook Graph API" and treated it as a trustworthy, near-auto- publish source. Confirmed wrong, and more importantly, moot: Facebook Graph API access for pulling public event data is not an option for the new platform, period — Meta locked this down years ago (~2018-2020), and no new integration can be built to rely on it. The new ingestion pipeline plans zero Facebook dependency (see ai-ingestion-pipeline.md) — whatever Facebook-sourced events exist on the legacy sites migrate once as historical data; they are not an ongoing source. The legacy plugin's actual mechanism (scraping, a stale integration, or a paid proxy service) remains unconfirmed and is no longer relevant to the new design either way.

Category & region taxonomy audit

Pulled real taxonomy term lists from all 4 sites (job-categories for venues, job_listing_region for regions, event_type for events). Two very different pictures emerged.

Venue categories: clean, hierarchical, and portable as-is

Real hierarchy example (LakeTravis): 38 categories under parents like "Restaurants" (38 venues), "Boat Rentals" (37), "Event Venue" (22), "Marinas" (21) — sensible, well-maintained, clearly curated by hand over time. Same quality bar on HeyAustin, CrestedButte, and 6thStreet.

Structural finding: HeyAustin, CrestedButte, and 6thStreet share the same parent category IDs (3500 = dining-ish, 3501 = entertainment-ish, 3513 = services, 3527 = lodging, 3533 = shopping, 3542 = arts) — they were clearly built from one shared starting template. LakeTravis has entirely different parent IDs (3774-3805) with its own top-level structure shaped around its lake-recreation focus (Boat Service, Marine Services, Yacht Charters as top-level groups that don't exist elsewhere). CrestedButte adds its own ski/outdoor-specific top level (parent 7194: Mountain Biking, Backcountry Skiing, Downhill Skiing, Nordic Skiing, Snowmobiling).

Recommendation: port venue categories directly, but design the new schema's category set as a shared canonical vocabulary (Restaurants, Bars, Live Entertainment, Lodging, Shopping, Services, Arts) plus brand-specific extensions (LakeTravis's boating categories, CrestedButte's winter-sports categories) rather than 4 fully independent flat lists — 3 of 4 sites already prove this shared structure works, and it's what lets the AI search and cross-brand admin views reason about categories consistently.

Regions: clean, flat, and reveal real geographic overlap

All region taxonomies are flat (no parent hierarchy) — real neighborhoods/ towns per brand's geography (HeyAustin: Dirty 6th Street, East Austin, Lake Travis, South Congress, etc.; LakeTravis: Lakeway, Volente, Spicewood, etc.; CrestedButte: Town of Crested Butte, Gunnison, Mount Crested Butte, etc.).

Notable overlap: 6thStreet's 4 regions (Old/East/West 6th Street, Red River District) are a literal subset of HeyAustin's regions of the same name. Strong signal that some venues are likely listed on both HeyAustin and 6thStreet — worth checking for actual duplicate venues during ETL (same address/phone across both sites) rather than assuming all 4 brands' content is fully disjoint.

Event categories (`event_type`): a mostly-unusable folksonomy — do not port as-is

Very different quality bar from venue categories. Pulled the full list (112 terms on HeyAustin/CrestedButte, 129 on LakeTravis) and most of it is noise: one-off auto-generated tags like "1138 studios", "gino barasa", "south congress bookstore", "murals in austin" sit alongside genuinely useful categories like "Festival" and "Entertainment", with no consistent curation. Spot-checking a sample event (LakeTravis's "Sunday Funday at Volente Beach") showed category: null — most events likely have no category assigned at all, and the small set that do are inconsistently tagged. A field_reference note in the hotel_vegas_scraper prior art mentioned specific term IDs "7201=Food & Drink, 7202=Live Music" — those IDs don't exist in the current 112-term list (max ID found: 3763), so either that note is stale, those IDs live in a different taxonomy than what the public API exposes, or the terms were since deleted. Unresolved, but doesn't change the recommendation either way.

Recommendation: don't port event_type as historical categorization signal — it isn't reliable enough to trust. Instead, design a small, clean, curated event-category set for the new platform (Live Music, Food & Drink, Nightlife, Arts & Culture, Family, Sports, Festivals) and use the AI ingestion pipeline itself to backfill categories on migrated historical events via LLM classification against that clean list — a good, concrete first use of the classification capability the pipeline already needs to build anyway, rather than trusting data that was never consistently curated in the first place.

Data completeness: geolocation and website coverage

Lat/lng: 100% complete. Checked every venue on all 3 Listar sites (667 total: 233 HeyAustin + 257 LakeTravis + 177 CrestedButte) via the list endpoint — zero missing coordinates anywhere. Mapping/geolocation features have a fully clean foundation to build on; no data-repair work needed here.

Website URL: ~4-8% missing, sampled (25 venues per site via the detail endpoint, since website only appears there, not in the list view): HeyAustin 2/25, LakeTravis 2/25, CrestedButte 1/25 missing. Extrapolated across ~667 venues, that's roughly 35-55 venues with no website on file at all. Two things worth noting:

There's no separate calendar_url field in the legacy data — venues only have one general website field. The ingestion pipeline (both the old vee.py and the new design) has to discover the actual events/calendar sub-page starting from that homepage, same as vee.py already does ("discovering calendar on https://meanwhilebeer.com").
For the ~5-8% with no website at all, and with Facebook no longer a live ingestion source (see correction above), those venues have no automated ingestion path whatsoever — manual entry is the only option unless/until someone adds a website for them. Worth flagging to curators early rather than discovering it venue-by-venue during rollout.

HeyAustin / 6thStreet venue overlap: confirmed, and larger than expected

Compared all 233 HeyAustin venues against all 107 6thStreet venues by normalized name. 55 venues (51% of 6thStreet's entire directory) match a HeyAustin venue by name, and spot-checking confirms these are genuinely the same real-world places, not coincidental name collisions — e.g. both sites have "Friends Bar" at matching slugs (friends-bar on both), "Hotel Vegas," "Flamingo Cantina," "24 Diner," "Vulcan Gas Company," etc. This lines up with the region-overlap finding (6thStreet's 4 regions are literally a subset of HeyAustin's) — 6thStreet functions largely as a curated subset of HeyAustin's downtown venues, not a fully independent directory.

Important nuance: this is not copy-paste duplication. Checked "Friends Bar" content on both sites — the descriptions are independently written (different opening lines, different framing), not the same text posted twice. So today, curators are maintaining two separately-authored listings for the same physical venue, and (relevant to the ingestion pipeline) that venue's event calendar would get scraped and processed twice — once per site — under the current one-venue-belongs-to-one-site schema design.

This changes a schema assumption. The current venues table models one venue as belonging to exactly one site (site_id FK, unique(site_id, slug)). Given ~55 real-world venues need to appear on two sites at once, worth deciding between:

Many-to-many: venues become site-independent entities, with a join table (venue_sites) controlling which site(s) a venue appears on, plus optional per-site override fields (description, category, featured image) since editorial voice does genuinely differ today. One venue, one set of core facts (address/phone/lat-lng), one ingestion target — shared across sites it's linked to.
Keep separate, but link: preserve today's independent-listing model (two rows, two independently-edited descriptions) but add a nullable self-referential linked_venue_id so the admin portal can surface "this venue also exists on HeyAustin" and, more importantly, the ingestion pipeline dedupes — scrape once, apply the resulting events to every linked venue row instead of scraping the same calendar twice.

Decided: option 2. Implemented as a venue_groups table + nullable venue_group_id on venues (see schema/001_initial_schema.sql) — rows stay separate and independently edited per site, linked only so the admin portal can surface "also listed on X" and the ingestion pipeline scrapes each real-world venue once, not once per site.

Rentals: much smaller, and likely coincidental overlap — not the same problem

Ran the identical name-match check against rentals: 44 of 543 6thStreet rentals (8.1%) match a HeyAustin rental, versus 51% for venues. Checked matched pairs' content directly (not just names this time) and it's literally identical on both sides — e.g. "Deluxe Studio" and "Brand new hideaway in the heart of the city..." both have trivial, auto-generated- looking content (just the title repeated, empty meta fields) on both sites.

This is a different situation from venues, not the same pattern at smaller scale:

Venues: real editorial descriptions, independently written per site → genuine duplicate-maintenance problem worth solving with venue_groups
Rentals: thin/likely auto-imported content, identical on both sides → reads as a shared third-party syndication feed (consistent with the easy-property-listings plugin finding) that two independent per-site pulls happen to overlap on, not curators duplicating manual work

Recommendation: rentals don't need a venue_groups-style linking mechanism — there's no editorial-duplication cost to solve for. The only relevant concern is migration-time dedup (don't import the same physical property as two unrelated rows if both sites' data get pulled into one system), which is a one-time ETL concern, not an ongoing admin-portal feature the way it was for venues.

Events: essentially no overlap, despite the shared venues

6thStreet's events aren't REST-accessible (confirmed earlier), so compared via sitemap slugs instead — extracted 6,083 event slugs from 6thStreet's facebook_events sitemap and checked them against all 1,212 of HeyAustin's live events (via Listar API). Result: 2 matches (0.2%), and both are Austin City Limits Music Festival weekends — the kind of event that would appear on nearly any Austin site regardless of any real connection between the two brands, not evidence of duplication.

This is genuinely surprising given the venue finding: 51% of 6thStreet's venues are the same physical places as HeyAustin's, yet essentially none of the events at those shared venues overlap between the two sites. The likely explanation is that each site runs its own Facebook Events import independently and incompletely — same venue, but each site's importer pulled a different (and largely non-overlapping) slice of that venue's Facebook activity over time, rather than either site doing a complete pull.

Does this change the venue_groups ingestion design? No — and it's worth being explicit about why. The new pipeline scrapes each venue's own website_url/calendar_url, not Facebook, so the "scrape once per group" design is based on there being one real calendar per real-world venue, which is still true regardless of what the legacy Facebook imports happened to capture. This finding is more a sign that the legacy Facebook automation was inconsistent/incomplete (further reason not to trust it as a data source, on top of the earlier Graph API correction) than a signal about how the new venue-linking design should work.

6thStreet's blog: real content, but abandoned since 2020

Pulled all 14 posts with dates and word counts. The writing itself is genuine, not filler — decent length (142-986 words) and real, topically substantive local-guide pieces: "Guide to Austin's 6th Street," "Best Austin Food Pub Tours," "Best Downtown Austin Arcade Bars." This was a real editorial effort at some point, not auto-generated placeholder content.

But the dates tell the actual story: the most recent post is from January 2020 — over 6 years with no new content. For contrast, checked the other 3 sites' most recent posts: HeyAustin (2026-06-25), LakeTravis (2026-05-30), CrestedButte (2026-05-28) — all actively publishing on a recurring pattern ("Things To Do This Weekend," seasonal guides, live music calendars) within the last few weeks. 6thStreet is the only one of the four with a fully dormant blog.

Consistent with everything else found about 6thStreet: no dedicated mobile app, no Listar backend, a different plugin stack, and now a blog that's been abandoned for 6+ years while its siblings actively publish. This reads as the deprioritized brand of the four — worth deciding explicitly whether 6thStreet's blog gets revived as part of the new platform or is treated as legacy content to migrate as-is without ongoing investment.

Open items for a deeper pass (not yet done)

None remaining from this audit round — all flagged items have been checked.