Data Audit — All 4 Sites
Live recon against production WordPress sites, 2026-07-05. Goal: know the real shape and scale of the data before building migration ETL, not guess.
Mobile apps — confirmed
HeyAustin, LakeTravis, and CrestedButte all have live mobile apps,
running on the same Listar REST backend (listar/v1 + jwt-auth/v1).
6thStreet does not have an app and does not run this backend at all.
Content volume
| Site | Venues | Events (live, via API) | Blog Posts | Rentals | Listar backend |
|---|---|---|---|---|---|
| HeyAustin | 233 | 1,212 | 45 | 452 | yes |
| LakeTravis | 257 | 1,709 | 58 | 656 | yes |
| CrestedButte | 177 | 741 | 45 | 217 | yes |
| 6thStreet | 107 | not API-accessible (~6,087 URLs in sitemap, historical) | 14 | 543 | no |
Takeaways:
- 6thStreet's blog is a fraction of the others' (14 vs 45-58) — a much smaller content operation on that brand, or content lives elsewhere
- 6thStreet's rental volume (543) is second-highest despite having the fewest venues — makes sense for a downtown entertainment-district brand (short-term condos/hotel suites) vs. lake-house rentals on LakeTravis
- Rentals are the same kind of content everywhere (short-term lodging) but
a different property style per brand's geography — the shared
rentalstable in the schema still holds
6thStreet: a genuinely different build, not just a smaller one
- No
listar/v1,jwt-auth/v1, orwpjm-internal/v1— instead hasgravityforms/v2, which no other site has - Venues use
job-listings(hyphenated post type) vs. the other sites'job_listing(underscored) — likely a different WP Job Manager version/config - Extra taxonomies not seen elsewhere:
tax_business_listing,tax_feature,resume_region— signs of a different plugin/theme lineage on this brand - Its event post type has
show_in_restdisabled — no REST route exists for events at all, despite the Facebook Events plugin clearly being active (fbevents/v1namespace present, ~6,087 URLs in the facebook_events sitemap). Confirmed: this is not something the other 3 sites also have hidden — their event data is fully REST-accessible.
Migration consequence: HeyAustin/LakeTravis/CrestedButte event data migrates cleanly via REST API pulls. 6thStreet's event history can only be reached by scraping rendered HTML pages or getting direct database/export access from whoever hosts it — a heavier, separate migration path for that one site specifically. Worth deciding early whether 6thStreet's ~6,000 historical event pages are worth migrating at all, or whether only future-dated/recent ones matter (most are almost certainly expired, one-time past events with no ongoing value beyond incidental SEO).
Correction: Facebook Graph API is not usable for the new build
Earlier notes described the Facebook Events importer as pulling "structured
data via the Facebook Graph API" and treated it as a trustworthy, near-auto-
publish source. Confirmed wrong, and more importantly, moot: Facebook
Graph API access for pulling public event data is not an option for the new
platform, period — Meta locked this down years ago (~2018-2020), and no
new integration can be built to rely on it. The new ingestion pipeline plans
zero Facebook dependency (see ai-ingestion-pipeline.md) — whatever
Facebook-sourced events exist on the legacy sites migrate once as historical
data; they are not an ongoing source. The legacy plugin's actual mechanism
(scraping, a stale integration, or a paid proxy service) remains unconfirmed
and is no longer relevant to the new design either way.
Category & region taxonomy audit
Pulled real taxonomy term lists from all 4 sites (job-categories for
venues, job_listing_region for regions, event_type for events). Two very
different pictures emerged.
Venue categories: clean, hierarchical, and portable as-is
Real hierarchy example (LakeTravis): 38 categories under parents like "Restaurants" (38 venues), "Boat Rentals" (37), "Event Venue" (22), "Marinas" (21) — sensible, well-maintained, clearly curated by hand over time. Same quality bar on HeyAustin, CrestedButte, and 6thStreet.
Structural finding: HeyAustin, CrestedButte, and 6thStreet share the same parent category IDs (3500 = dining-ish, 3501 = entertainment-ish, 3513 = services, 3527 = lodging, 3533 = shopping, 3542 = arts) — they were clearly built from one shared starting template. LakeTravis has entirely different parent IDs (3774-3805) with its own top-level structure shaped around its lake-recreation focus (Boat Service, Marine Services, Yacht Charters as top-level groups that don't exist elsewhere). CrestedButte adds its own ski/outdoor-specific top level (parent 7194: Mountain Biking, Backcountry Skiing, Downhill Skiing, Nordic Skiing, Snowmobiling).
Recommendation: port venue categories directly, but design the new schema's category set as a shared canonical vocabulary (Restaurants, Bars, Live Entertainment, Lodging, Shopping, Services, Arts) plus brand-specific extensions (LakeTravis's boating categories, CrestedButte's winter-sports categories) rather than 4 fully independent flat lists — 3 of 4 sites already prove this shared structure works, and it's what lets the AI search and cross-brand admin views reason about categories consistently.
Regions: clean, flat, and reveal real geographic overlap
All region taxonomies are flat (no parent hierarchy) — real neighborhoods/ towns per brand's geography (HeyAustin: Dirty 6th Street, East Austin, Lake Travis, South Congress, etc.; LakeTravis: Lakeway, Volente, Spicewood, etc.; CrestedButte: Town of Crested Butte, Gunnison, Mount Crested Butte, etc.).
Notable overlap: 6thStreet's 4 regions (Old/East/West 6th Street, Red River District) are a literal subset of HeyAustin's regions of the same name. Strong signal that some venues are likely listed on both HeyAustin and 6thStreet — worth checking for actual duplicate venues during ETL (same address/phone across both sites) rather than assuming all 4 brands' content is fully disjoint.
Event categories (event_type): a mostly-unusable folksonomy — do not port as-is
Very different quality bar from venue categories. Pulled the full list
(112 terms on HeyAustin/CrestedButte, 129 on LakeTravis) and most of it is
noise: one-off auto-generated tags like "1138 studios", "gino barasa",
"south congress bookstore", "murals in austin" sit alongside genuinely
useful categories like "Festival" and "Entertainment", with no consistent
curation. Spot-checking a sample event (LakeTravis's "Sunday Funday at
Volente Beach") showed category: null — most events likely have no
category assigned at all, and the small set that do are inconsistently
tagged. A field_reference note in the hotel_vegas_scraper prior art
mentioned specific term IDs "7201=Food & Drink, 7202=Live Music" — those IDs
don't exist in the current 112-term list (max ID found: 3763), so either
that note is stale, those IDs live in a different taxonomy than what the
public API exposes, or the terms were since deleted. Unresolved, but doesn't
change the recommendation either way.
Recommendation: don't port event_type as historical categorization
signal — it isn't reliable enough to trust. Instead, design a small, clean,
curated event-category set for the new platform (Live Music, Food & Drink,
Nightlife, Arts & Culture, Family, Sports, Festivals) and use the AI
ingestion pipeline itself to backfill categories on migrated historical
events via LLM classification against that clean list — a good, concrete
first use of the classification capability the pipeline already needs to
build anyway, rather than trusting data that was never consistently curated
in the first place.
Data completeness: geolocation and website coverage
Lat/lng: 100% complete. Checked every venue on all 3 Listar sites (667 total: 233 HeyAustin + 257 LakeTravis + 177 CrestedButte) via the list endpoint — zero missing coordinates anywhere. Mapping/geolocation features have a fully clean foundation to build on; no data-repair work needed here.
Website URL: ~4-8% missing, sampled (25 venues per site via the detail
endpoint, since website only appears there, not in the list view):
HeyAustin 2/25, LakeTravis 2/25, CrestedButte 1/25 missing. Extrapolated
across ~667 venues, that's roughly 35-55 venues with no website on file at
all. Two things worth noting:
- There's no separate
calendar_urlfield in the legacy data — venues only have one generalwebsitefield. The ingestion pipeline (both the oldvee.pyand the new design) has to discover the actual events/calendar sub-page starting from that homepage, same asvee.pyalready does ("discovering calendar on https://meanwhilebeer.com"). - For the ~5-8% with no website at all, and with Facebook no longer a live ingestion source (see correction above), those venues have no automated ingestion path whatsoever — manual entry is the only option unless/until someone adds a website for them. Worth flagging to curators early rather than discovering it venue-by-venue during rollout.
HeyAustin / 6thStreet venue overlap: confirmed, and larger than expected
Compared all 233 HeyAustin venues against all 107 6thStreet venues by
normalized name. 55 venues (51% of 6thStreet's entire directory) match a
HeyAustin venue by name, and spot-checking confirms these are genuinely
the same real-world places, not coincidental name collisions — e.g. both
sites have "Friends Bar" at matching slugs (friends-bar on both), "Hotel
Vegas," "Flamingo Cantina," "24 Diner," "Vulcan Gas Company," etc. This lines
up with the region-overlap finding (6thStreet's 4 regions are literally a
subset of HeyAustin's) — 6thStreet functions largely as a curated subset of
HeyAustin's downtown venues, not a fully independent directory.
Important nuance: this is not copy-paste duplication. Checked "Friends Bar" content on both sites — the descriptions are independently written (different opening lines, different framing), not the same text posted twice. So today, curators are maintaining two separately-authored listings for the same physical venue, and (relevant to the ingestion pipeline) that venue's event calendar would get scraped and processed twice — once per site — under the current one-venue-belongs-to-one-site schema design.
This changes a schema assumption. The current venues table models one
venue as belonging to exactly one site (site_id FK, unique(site_id, slug)). Given ~55 real-world venues need to appear on two sites at once,
worth deciding between:
- Many-to-many: venues become site-independent entities, with a join
table (
venue_sites) controlling which site(s) a venue appears on, plus optional per-site override fields (description, category, featured image) since editorial voice does genuinely differ today. One venue, one set of core facts (address/phone/lat-lng), one ingestion target — shared across sites it's linked to. - Keep separate, but link: preserve today's independent-listing model
(two rows, two independently-edited descriptions) but add a nullable
self-referential
linked_venue_idso the admin portal can surface "this venue also exists on HeyAustin" and, more importantly, the ingestion pipeline dedupes — scrape once, apply the resulting events to every linked venue row instead of scraping the same calendar twice.
Decided: option 2. Implemented as a venue_groups table + nullable
venue_group_id on venues (see schema/001_initial_schema.sql) — rows
stay separate and independently edited per site, linked only so the admin
portal can surface "also listed on X" and the ingestion pipeline scrapes
each real-world venue once, not once per site.
Rentals: much smaller, and likely coincidental overlap — not the same problem
Ran the identical name-match check against rentals: 44 of 543 6thStreet rentals (8.1%) match a HeyAustin rental, versus 51% for venues. Checked matched pairs' content directly (not just names this time) and it's literally identical on both sides — e.g. "Deluxe Studio" and "Brand new hideaway in the heart of the city..." both have trivial, auto-generated- looking content (just the title repeated, empty meta fields) on both sites.
This is a different situation from venues, not the same pattern at smaller scale:
- Venues: real editorial descriptions, independently written per site →
genuine duplicate-maintenance problem worth solving with
venue_groups - Rentals: thin/likely auto-imported content, identical on both sides →
reads as a shared third-party syndication feed (consistent with the
easy-property-listingsplugin finding) that two independent per-site pulls happen to overlap on, not curators duplicating manual work
Recommendation: rentals don't need a venue_groups-style linking
mechanism — there's no editorial-duplication cost to solve for. The only
relevant concern is migration-time dedup (don't import the same physical
property as two unrelated rows if both sites' data get pulled into one
system), which is a one-time ETL concern, not an ongoing admin-portal
feature the way it was for venues.
Events: essentially no overlap, despite the shared venues
6thStreet's events aren't REST-accessible (confirmed earlier), so compared
via sitemap slugs instead — extracted 6,083 event slugs from 6thStreet's
facebook_events sitemap and checked them against all 1,212 of HeyAustin's
live events (via Listar API). Result: 2 matches (0.2%), and both are
Austin City Limits Music Festival weekends — the kind of event that would
appear on nearly any Austin site regardless of any real connection between
the two brands, not evidence of duplication.
This is genuinely surprising given the venue finding: 51% of 6thStreet's venues are the same physical places as HeyAustin's, yet essentially none of the events at those shared venues overlap between the two sites. The likely explanation is that each site runs its own Facebook Events import independently and incompletely — same venue, but each site's importer pulled a different (and largely non-overlapping) slice of that venue's Facebook activity over time, rather than either site doing a complete pull.
Does this change the venue_groups ingestion design? No — and it's
worth being explicit about why. The new pipeline scrapes each venue's own
website_url/calendar_url, not Facebook, so the "scrape once per group"
design is based on there being one real calendar per real-world venue,
which is still true regardless of what the legacy Facebook imports happened
to capture. This finding is more a sign that the legacy Facebook
automation was inconsistent/incomplete (further reason not to trust it as
a data source, on top of the earlier Graph API correction) than a signal
about how the new venue-linking design should work.
6thStreet's blog: real content, but abandoned since 2020
Pulled all 14 posts with dates and word counts. The writing itself is genuine, not filler — decent length (142-986 words) and real, topically substantive local-guide pieces: "Guide to Austin's 6th Street," "Best Austin Food Pub Tours," "Best Downtown Austin Arcade Bars." This was a real editorial effort at some point, not auto-generated placeholder content.
But the dates tell the actual story: the most recent post is from January 2020 — over 6 years with no new content. For contrast, checked the other 3 sites' most recent posts: HeyAustin (2026-06-25), LakeTravis (2026-05-30), CrestedButte (2026-05-28) — all actively publishing on a recurring pattern ("Things To Do This Weekend," seasonal guides, live music calendars) within the last few weeks. 6thStreet is the only one of the four with a fully dormant blog.
Consistent with everything else found about 6thStreet: no dedicated mobile app, no Listar backend, a different plugin stack, and now a blog that's been abandoned for 6+ years while its siblings actively publish. This reads as the deprioritized brand of the four — worth deciding explicitly whether 6thStreet's blog gets revived as part of the new platform or is treated as legacy content to migrate as-is without ongoing investment.
Open items for a deeper pass (not yet done)
None remaining from this audit round — all flagged items have been checked.