Skip to content

Persist and expose multi-battle station state#353

Open
Mygod wants to merge 15 commits intoUnownHash:mainfrom
Mygod:multibattle
Open

Persist and expose multi-battle station state#353
Mygod wants to merge 15 commits intoUnownHash:mainfrom
Mygod:multibattle

Conversation

@Mygod
Copy link
Copy Markdown
Contributor

@Mygod Mygod commented Mar 31, 2026

Why

Max Battles can temporarily override a station's normal local battle. Golbat previously stored only one flat battle_* payload on station, so a temporary event battle would overwrite the local one and the original state would be lost for planning and downstream filtering.

This PR adds first-class multi-battle support for stations while keeping the existing flat station/webhook fields for compatibility.

Ground rule

Latest observed game data is treated as ground truth.

For a given station:

  • newly observed battles are persisted independently
  • if a new battle has a later or equal battle_end than another known battle on the same station, the older/equal-ending row is treated as obsolete and removed
  • a shorter battle does not remove a longer known battle

This keeps temporary overrides and later event rotations aligned with the most recent game state instead of waiting for old rows to expire naturally.

What this PR adds

  • a new station_battle table via migration 54_station_battle.up.sql
  • station battle persistence separate from the parent station row
  • a compatibility projection that still fills station.battle_*
  • a battles array in station API and max_battle webhook payloads
  • multi-battle-aware station filtering instead of single-battle-only filtering
  • hydration of station_battle rows in both preload and lazy station reads
  • expired station_battle cleanup support
  • canonical-battle-based metric dedup

Canonical battle behavior

A station can now know about multiple non-expired battles, but one battle is still chosen as the canonical representative for:

  • flat station.battle_*
  • top-level max_battle webhook fields
  • battle metric dedup

Selection rule:

  • prefer an active battle over an upcoming one
  • if none are active, use the best upcoming battle
  • later battle_end wins among surviving rows

The full battles array remains available for planning use cases.

Identity / persistence

bread_battle_seed is used as the persisted battle identity.

Live validation before implementation showed:

  • no observed seed reuse across multiple stations
  • seeds changed when battles rotated

That made seed the best available battle-instance identifier for persistence.

API / webhook impact

Backward compatibility is preserved:

  • existing top-level battle_* station fields remain
  • existing top-level max_battle webhook fields remain

New additive fields:

  • station API: battles
  • max_battle webhook: battles

This allows downstream consumers to keep working unchanged while giving planning-oriented consumers the full set.

Deployment notes

  • migration 54_station_battle.up.sql is included and runs automatically on startup
  • no manual backfill from legacy station.battle_* into station_battle is included
  • population of station_battle depends on normal rescans after deploy
  • optional cleanup support is available via cleanup.station_battles

Practical rollout expectation:

  • after deploy, rescans repopulate the new table quickly
  • during the current non-overlap period, most stations will still have exactly one battle row

Validation

Tested with:

  • go test ./decoder -run 'StationBattle|StationWebhooks|BuildStationResult|FortLookup|SyncStation|Preload'
  • go test ./... -run '^$'

Post-deploy live DB checks on one environment showed:

  • migration applied cleanly at version 54
  • no active legacy-only stations
  • no active projection mismatches between station and station_battle
  • no active station_battle rows missing boss data
  • currently one active battle row per station, which matches the present daily-rotation state

Notes

This PR is intended to preserve battles for planning, especially around Monday / Max Battle event overlaps, while keeping existing API and webhook consumers stable.

Fixes #352.

@Mygod Mygod marked this pull request as draft April 1, 2026 00:00
@jfberry
Copy link
Copy Markdown
Collaborator

jfberry commented Apr 1, 2026

I'll leave others to comment on whether this is desirable functionality. I don't really care much one way or the other. I do have a few comments on the way its implemented as this doesn't follow commonly held patterns within the code base, or even guidelines in claude.md

An automated review highlights some quite significant variances and performance observations - I haven't bothered with a human review until others chime in about whether the capability is one we want to have or not.

Database Write Impact — Significant Change

Before: Station updates go through the write-behind queue (batched, coalesced, deferred).

After: Every station GMO update now also does a synchronous transactional write outside the write-behind queue:

BEGIN
INSERT INTO station_battle ... ON DUPLICATE KEY UPDATE (upsert)
DELETE FROM station_battle WHERE station_id=? AND seed<>? AND battle_end<=? (prune)
COMMIT

That's 4 DB round-trips per station per GMO, executed synchronously while holding the station lock. This blocks other goroutines waiting
on the same station for the duration of the DB transaction. During daily rotation windows when many stations update simultaneously, this
could become a bottleneck.

The write goes to GeneralDb (shared connection pool), so it competes with all other non-pokemon writes.

Performance Concerns

  1. Synchronous DB in hot path: syncStationBattlesFromProto is called from UpdateStationBatch (the GMO decode loop) while the station lock
    is held. Every other entity in this codebase defers DB writes through the write-behind queue — station battles are the exception.
  2. Repeated getKnownStationBattles calls: A single station save triggers this function ~4-5 times (snapshot, sync, rtree update, webhook,
    API build). Each call does a cache load + clone + sort. Not expensive per-call but multiplied across all stations per GMO.
  3. Side-effecting reads: getKnownStationBattles prunes expired battles from the cache as a side effect of reading. Other code paths in
    this codebase don't mutate cache state on read.
  4. String-based change detection: stationBattleSignatureFromSlice builds a signature string via fmt.Sprintf per battle. This is computed
    multiple times per update cycle. Minor, but a hash or struct comparison would be cheaper.

Pattern Adherence

Follows patterns well:

  • xsync.MapOf for concurrent cache (matches fortLookupCache)
  • Phased preload with correct ordering (battles after stations, parallel with incidents)
  • Cleanup uses the same ticker pattern as incidents/pokemon/tappables
  • Backward-compatible API additions (flat fields remain, new battles array is additive)
  • Deadlock retry with exponential backoff (matches write-behind queue pattern)
  • Test mocking via function variable (upsertStationBattleRecordFunc)

Deviates from patterns:

  1. No write-behind queue — every other persisted entity uses TypedQueue[K, T] for batched, coalesced, rate-limited DB writes. Station
    battles do direct synchronous transactional writes. This is the most significant deviation.
  2. No TTL-based cache — uses plain xsync.MapOf instead of ttlcache.Cache or ShardedCache. Expiry is handled by read-time pruning + hourly
    cleanup. If a station is never re-read, its expired battles stay in memory until the cleanup routine runs.
  3. Ad-hoc control flow flags — forceSave and skipWebhook on Station are new patterns. Other entities use dirty tracking and old/new value
    comparison exclusively. skipWebhook in particular is a hidden side-channel that short-circuits the webhook path.
  4. Transaction-based writes — storeStationBattleRecord is the only entity using explicit BEGIN/COMMIT. Other entities use single INSERT
    ... ON DUPLICATE KEY UPDATE statements.
  5. hydrateStationBattlesForStation is called without station lock in GetStationRecordReadOnly (before GetOrSetFunc). If another goroutine
    is processing a GMO for the same station concurrently, both could race on stationBattleCache.Store. xsync.MapOf won't crash, but one
    write could overwrite the other.

@jfberry
Copy link
Copy Markdown
Collaborator

jfberry commented Apr 1, 2026

Looking at the CLAUDE.md instructions, let me check which PR patterns violate the documented rules.

Code Review — Pattern Compliance with Project Architecture

The project's CLAUDE.md documents specific architectural patterns that all entities are expected to follow. This PR
introduces StationBattleData as a new persisted entity but deviates from several of these documented patterns.

1. Write-behind queue for persistence

Each entity type has a TypedQueue[K, T] that batches and coalesces writes
— CLAUDE.md, "Write-Behind Queues" section

Every other persisted entity (pokemon, gym, pokestop, station, spawnpoint) uses the write-behind queue for DB persistence.
station_battle bypasses this entirely with direct synchronous transactional writes (BEGIN / INSERT ... ON DUPLICATE KEY UPDATE /
DELETE / COMMIT) executed in the GMO decode hot path while holding the station lock.

This has practical consequences: 4 DB round-trips per station per GMO, no coalescing of repeated updates, and the station lock is held
for the full duration of the DB transaction.

2. Entity struct pattern

type Pokestop struct {
    mu TrackedMutex[string] `db:"-"`  // Entity-level mutex
    PokestopData                       // Embedded — copied for queue snapshots
    dirty     bool     `db:"-"`        // Needs DB write
    newRecord bool     `db:"-"`        // INSERT vs UPDATE
    oldValues PokestopOldValues `db:"-"` // Snapshot for webhook comparison
}

— CLAUDE.md, "Struct Pattern" section

StationBattleData is a plain struct with none of these: no entity-level mutex, no embedded data struct for queue snapshots, no dirty
flag, no newRecord flag, no oldValues snapshot. This is the foundational pattern for all entities in the codebase.

3. Record access pattern

Four access patterns, from lightest to heaviest:

  1. Peek*Record — Cache-only lookup
  2. get*RecordReadOnly — Cache lookup with DB fallback
  3. get*RecordForUpdate — ReadOnly + snapshot
  4. getOrCreate*Record — Atomic create-if-absent

The caller MUST call the returned unlock function.
— CLAUDE.md, "Record Access Patterns" section

Station battles have no record access functions. The stationBattleCache is read and written directly (via
xsync.MapOf.Load/Store/Delete) without acquiring an entity lock. Notably, hydrateStationBattlesForStation in
GetStationRecordReadOnly mutates stationBattleCache before the station lock is acquired, which could race with
syncStationBattlesFromProto on another goroutine.

4. Setter methods with dirty tracking

Setter methods (SetName, SetLat, etc.) track dirty state and optionally log field changes when dbDebugEnabled is true.
— CLAUDE.md, "Struct Pattern" section

StationBattleData fields are assigned directly with no setters and no dirty tracking. To work around this, the PR adds a forceSave
flag on Station that bypasses the dirty check in saveStationRecord. A skipWebhook flag is also added as a side-channel to suppress
webhooks on DB failure. Neither pattern exists on any other entity.

@Mygod Mygod marked this pull request as ready for review April 6, 2026 05:42
@Mygod
Copy link
Copy Markdown
Contributor Author

Mygod commented Apr 6, 2026

ReactMap PR is done. From my testing, everything seems ok. Feel free to complain now. :)

I'm also planning to do another review pass in two days.

@jfberry
Copy link
Copy Markdown
Collaborator

jfberry commented Apr 8, 2026

I haven't done a manual review of it yet, but here is the claude first pass review - which picks up some issues (mainly around the database write not being batched like all the other writers - without looking it's likely to be a msunderstanding of the way sqlx can construct a statement from an array).

PR #353 Review — Persist multi-battle station state

The PR has clearly been revised since my earlier comments — forceSave/skipWebhook, the synchronous-in-hot-path writes, and the unprotected
xsync.Store race in GetStationRecordReadOnly are all gone. It's much closer to project conventions now. A stationBattleQueue TypedQueue[string,
StationBattleWrite] exists; hydration runs under the station lock; the rtree update is fed from the same snapshot used by save/webhook.

Record access semantics

  • Hydration in both branches of GetStationRecordReadOnly now happens after station.Lock(caller), so the previous race on
    stationBattleCache.Store is gone.
  • getOrCreateStationRecord only hydrates when loading a fresh record from DB. A station that was created via proto-only path (newRecord=false
    after the initial save) never has its battles hydrated until somebody calls GetStationRecordReadOnly, but hasHydratedStationBattles is set the
    first time a proto carries battles via syncStationBattlesFromProto, so this is fine in practice.
  • Cache-hit hydration error in GetStationRecordReadOnly is only Debug-logged and still returns the station (station_state.go ~L83). The DB-load
    branch returns the error. Inconsistent — and a transient SELECT failure leaves the station permanently un-hydrated for that session because
    nothing flips hasHydratedStationBattles. Recommend marking hydrated even on empty/failure or returning the error. Minor.

Deadlocks / lock ordering

  • No new cross-entity lock acquisitions. All battle state is keyed by stationId and accessed under the station mutex (or via lock-free
    xsync.MapOf for the read paths). Consistent with CLAUDE.md "Lock Ordering" rules.
  • hydrateStationBattlesForStation does a synchronous DB SELECT while holding the station lock. That's the same pattern loadStationFromDatabase
    uses, so it's not a regression, but it does mean a slow station_battle query stalls every other goroutine waiting on that station. Fine in
    steady state because hydration only runs once per station per process lifetime.

Write-behind correctness

  • stationBattleQueue is a real TypedQueue keyed by StationId, so repeated updates within a flush window coalesce — good.
  • However, flushStationBattleBatch just loops the snapshots and calls storeStationBattleSnapshot one-at-a-time. Each call does BEGIN →
    bulk-INSERT … ON DUPLICATE KEY UPDATE → DELETE … WHERE seed NOT IN (…) → COMMIT. So a flush of 50 stations = 50 transactions × 4 round-trips =
    ~200 round-trips against GeneralDb. The BatchSize knob doesn't actually batch across stations the way other entities' flushes do. Not broken,
    but the queue is providing coalescing only, not true batching.
  • The DELETE … NOT IN (…) strategy is correct for pruning stale seeds without losing newly-inserted rows in the same tx, and the per-snapshot
    tx scope means a flush failure for one station doesn't poison others. Good.
  • Coalescing risk: when a snapshot is enqueued, the Battles slice is cloned (stationBattleWriteFromSlice → cloneStationBattles), so the queue
    holds a stable snapshot even if the in-memory cache mutates afterwards. Good.

Database write volume

  • For active stations the write trigger is battleListChanged (signature mismatch). The signature is computed from the non-expired battle list,
    so the signature changes whenever:
    a. a new battle arrives (expected),
    b. a known battle expires across now boundary (NOT a real change in DB-relevant data, but still triggers an enqueue).
  • Effect: every station with an active battle will write at least once when that battle ends, even if no proto for it arrives. Bounded — one
    extra write per battle expiry. Acceptable.
  • The 15-min debounce is still bypassed by battleListChanged, so high-churn stations can write more often than before. Should be fine given the
    queue coalesces, but worth knowing.

Memory

  • Per-station: a []StationBattleData (typically 1–3 entries, ~200 B each) in stationBattleCache + an empty struct entry in
    stationBattleHydratedCache. For ~100k stations, well under 100 MB worst-case. No leak — clearStationBattleCaches is wired into the station
    cache eviction callback. Good.
  • The eviction callback is now installed unconditionally (not gated on FortInMemory), which is necessary for battle-cache cleanup and
    incidentally fixes a small consistency issue.

CLAUDE.md / pattern adherence

  • ✅ Write-behind queue used.
  • ✅ Cache eviction callback drops auxiliary state.
  • ✅ Hydration done under entity lock.
  • ✅ Webhook payload built from a single snapshot, not by re-reading mutable state.
  • ⚠️ StationBattleData still doesn't follow the standard "embedded Data + dirty + oldValues" entity pattern — but it's a child collection of an
    existing entity, not a standalone entity, so the deviation is justified. The BattleListSignature in StationOldValues is the dirty-tracking
    equivalent.
  • ⚠️ stationBattleSignatureFromSlice is awkward — fields are written in a strange order with Valid flags interleaved between value writes
    (station_battle.go ~L865). It works, but a bytes.Buffer/fnv hash or struct equality would be cheaper and less error-prone. Nit.
  • ⚠️ Webhook regression: createStationWebhooksWithSnapshot returns early if currentSignature == "". That suppresses a webhook in the "last
    battle just ended" transition. Previously a station whose BattleEnd rolled past now would still get a MaxBattle webhook on the EDIT. May or may
    not be intentional — worth confirming with the author.

Bugs / smells

  1. Cache-hit hydration error swallowed (above) — minor.
  2. fortRtreeUpdateStationOnGet no longer called inside GetOrSetFunc — moved outside. If two goroutines race, only the winner's loadedFromDb is
    true; the loser sees loadedFromDb=false and won't update the rtree. The winner does update it. Correct, though subtle.
  3. StartStationBattleExpiry cleanup interval is 1h13m (stats.go). The other expiry tickers use similarly arbitrary primes — consistent with
    existing convention.
  4. Signature-driven save bypasses 15-min debounce — see above. Worth a one-line comment in the code.
  5. stationBattleSignatureFromSlice ordering — see above. Nit, not a bug.

Recommendation

  1. Confirms the "battle ended → no battles left" webhook suppression is intentional (or restores it).
  2. Either logs the cache-hit hydration error at Warn/Error or treats it like the DB-load path.

@Mygod
Copy link
Copy Markdown
Contributor Author

Mygod commented Apr 8, 2026

I identified some overengineering in this branch. Once I fix that I will take a closer look at these.

@Mygod
Copy link
Copy Markdown
Contributor Author

Mygod commented Apr 8, 2026

Codex:

Addressed the batching concern on the current head.

station_battle now flushes like the other typed write-behind queues instead of only coalescing by station_id: flushStationBattleBatch does one bulk upsert across the flushed rows and one bulk prune across the flushed station ids in a single transaction. I intentionally kept the failure semantics aligned with the existing TypedQueue behavior used elsewhere in Golbat, so batch-level failure isolation is the same as the other batched writers rather than preserving a station_battle-specific special case.

The earlier notes about the old hydrated-cache/canonical helper paths are stale after the recent cleanup and no longer apply to the current code.

I did not take the separate seed-collision/schema concern in this branch.

I believe the battle seed is unique (it appears to be a random int64) so defensive primary key by station id is most likely unneeded.

@jfberry
Copy link
Copy Markdown
Collaborator

jfberry commented Apr 8, 2026

Claude is generally a fan (with some minor comments), I'll probably have some time to look myself in a few days time.

⏺ PR #353 — Re-review of the latest revision (f8e40cd)

This revision is a substantial cleanup. Most of my prior concerns are resolved.

What got better

  1. Real batched flush. flushStationBattleBatch is now a single transaction per flush: one bulk INSERT … ON DUPLICATE KEY UPDATE across every
    station in the batch, one DELETE … LEFT JOIN keep_rows to prune obsolete seeds, one commit. The previous "loop and call
    storeStationBattleSnapshot per station" pattern is gone, and with it the concern that the queue was providing only coalescing and no batching.
    DB round-trip volume now scales with BatchSize like the other entities.
  2. Single cache instead of two. stationBattleCache is now xsync.MapOf[string, stationBattleState] carrying {Battles, Loaded}. The parallel
    stationBattleHydratedCache is gone, eliminating the small window where the two maps could disagree. storeStationBattles is the one entry point.
  3. No more "canonical vs projection" duality. stationBattleFromStationProjection, canonicalStationBattleFromSlice,
    stationBattleMatchesProjection, canonicalBattleSeed, restoreStationBattleProjectionFromOldValues — all deleted. The "top" battle is now just
    snapshot.Battles[0] after sortStationBattlesByEnd. StationOldValues shrinks to {EndTime, HasTopBattle, TopBattleSeed, BattleListSignature}.
    Much easier to reason about.
  4. Single source of truth for battle fields on Station. updateFromStationProto no longer pokes battle fields directly — they flow exclusively
    through syncStationBattlesFromProto → applyStationBattleProjection. The previous risk of the proto path and the cache path disagreeing is gone.
  5. Tests consolidated into a table-driven TestUpsertCachedStationBattleOrdering covering the merge edge cases (drops earlier end, replaces
    equal end, drops conflicting future battle, etc.).

Remaining items

  1. Cache-hit hydration error is still Debug-logged, not returned (station_state.go ~L81). I noted this last round. The author appears to have
    intentionally kept it: TestGetStationRecordReadOnlyRetriesHydrationOnCachedStation exists specifically to assert that a transient hydration
    failure is retried on the next read. So the behavior is deliberate — a transient SELECT failure doesn't poison the station, it just retries.
    I'd bump it to Warn so failures are visible in logs, but this is a nit not a blocker.
  2. Webhook suppression on "all battles just expired." createStationWebhooksWithSnapshot still returns early if currentSignature == "". Net
    effect: when the last active battle expires, no MaxBattle webhook fires for that transition. Likely intentional (the next station-update GMO
    will produce one once is_battle_available toggles), but I'd ask the author to confirm.
  3. Signature-driven save still bypasses the 15-min debounce. Same as last round, but now harmless because the queue coalesces writes, so a
    station whose battles tick across an expiry boundary at most produces one extra coalesced enqueue, not extra DB traffic.
  4. mergeStationBattles is subtle. The "observed is current but not yet active, keep it iff no active battle covers it" branch reads correctly
    against the test cases but is the trickiest function in the file. It's well-tested. Just flagging that future maintenance should be wary of
    edits here.
  5. getKnownStationBattles still does a write-on-read when all battles are expired (collapses to {Loaded:true, Battles:nil}). Bounded and tested
    (TestGetKnownStationBattlesDoesNotMutateCacheOnRead shows it leaves the cache alone when at least one battle is current). Fine.

Lock semantics, deadlocks, memory

  • All hydration is still under the station mutex, no cross-entity locks introduced.
  • Per-station memory cost unchanged (small slice of battles, typically 1–3 entries).
  • Cache eviction callback still wired correctly via clearStationBattleState.

Verdict

A small ask that the author confirm the "all battles expired" webhook suppression is intentional and consider bumping
the cache-hit hydration error to Warn. The architectural deviations from CLAUDE.md that I flagged previously are gone or are now defensible:
the entity uses the write-behind queue, batching is real, hydration is lock-protected, and the projection/canonical machinery has collapsed
into "first element of a sorted slice." The code is meaningfully simpler than the prior revision while doing more.

@Mygod
Copy link
Copy Markdown
Contributor Author

Mygod commented Apr 8, 2026

I reworked the station_battle path again and the current head should address the earlier architectural concerns:

  • station_battle now flushes as a real write-behind batch again, not just coalescing by station_id
  • battle cache state is now a single {Battles, Loaded} map instead of separate cache + hydrated maps
  • the old canonical/projection split is gone; the top battle is just snapshot.Battles[0]
  • station battle projection is now owned only by syncStationBattlesFromProto

On the two remaining behavioral points:

  • The currentSignature == "" early return in createStationWebhooksWithSnapshot is intentional for now. That means we do not emit a max_battle webhook on the “last known battle expired” transition.
  • Cache-hit hydration failure in GetStationRecordReadOnly is also intentional for now: on a transient station_battle read failure we still serve the cached station and retry hydration on the next read rather than poisoning the station state. I can bump that log from Debug to Warn later if we want more visibility, but I left the behavior unchanged on purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-battle stations

2 participants