Skip to content

fix: address review findings for leader election recovery#94

Draft
ssteele110 wants to merge 1 commit intossteele/leader-election-recoveryfrom
ssteele/leader-election-recovery-fixes
Draft

fix: address review findings for leader election recovery#94
ssteele110 wants to merge 1 commit intossteele/leader-election-recoveryfrom
ssteele/leader-election-recovery-fixes

Conversation

@ssteele110
Copy link
Contributor

Summary

Addresses findings from multi-perspective code review of #92.

Critical fixes:

  • current_generation made public — fixes BuildRecord respond_to? always returning false (requeue data was silently lost)
  • Fenced push() via Lua script — prevents stale master from overwriting active generation after lock expiry
  • Lock refresh during population — prevents TTL expiry on large test suites
  • wait_for_master detects immediate nil status (workers no longer spin 120s when master died before they started)

Code quality:

  • Removed duplicate generation_key/learn_generation/generation_stale? from Worker (inherits from Base)
  • Base generation_key raises instead of silent fallback to non-generation keys
  • Supervisor rescues MasterDied alongside LostMaster
  • generation_stale? check moved to idle path (reduces Redis GETs per poll iteration)
  • Default master_lock_ttl increased from 30s to 120s

Tests (8 new):

  • Election retry on master death
  • Fenced push rejects stale master
  • Generation staleness exits poll loop
  • learn_generation raises when key missing
  • max_election_attempts exhausted raises LostMaster
  • BuildRecord reads generation-scoped requeue key
  • Supervisor handles MasterDied
  • wait_for_master detects immediate nil status

Test plan

  • All 8 new tests pass
  • Full test suite passes
  • Manual verification of fenced push under simulated lock expiry

- Make current_generation public so BuildRecord can access it
- Remove duplicate generation_key/learn_generation from Worker (inherit from Base)
- Base generation_key now raises instead of silent fallback
- Fence push() with Lua script to prevent stale master overwrite
- Add lock refresh during population to prevent TTL expiry
- Supervisor rescues MasterDied alongside LostMaster
- wait_for_master detects immediate nil status (no prior 'setup' needed)
- Move generation_stale? check to idle path to reduce Redis GETs
- Increase default master_lock_ttl from 30s to 120s
- Add 8 tests covering election retry, fenced push, generation
  staleness, and error handling paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant