Skip to content

chore: remove global epoxy contention#4554

Merged
NathanFlurry merged 1 commit intomainfrom
04-04-chore_remove_global_epoxy_contention
Apr 5, 2026
Merged

chore: remove global epoxy contention#4554
NathanFlurry merged 1 commit intomainfrom
04-04-chore_remove_global_epoxy_contention

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app bot commented Apr 4, 2026

🚅 Deployed to the rivet-pr-4554 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web Apr 5, 2026 at 11:47 am
mcp-hub ✅ Success (View Logs) Web Apr 5, 2026 at 11:11 am
website ❌ Build Failed (View Logs) Web Apr 5, 2026 at 11:11 am
frontend-inspector ❌ Build Failed (View Logs) Web Apr 4, 2026 at 7:19 pm
frontend-cloud ❌ Build Failed (View Logs) Web Apr 4, 2026 at 7:19 pm
ladle ❌ Build Failed (View Logs) Web Apr 4, 2026 at 7:19 pm

@NathanFlurry NathanFlurry mentioned this pull request Apr 4, 2026
11 tasks
@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

PR Review: chore: remove global epoxy contention

This is a significant rewrite of the Epoxy distributed consensus engine — replacing EPaxos (global log with cross-key interference tracking) with a simplified per-key single-decree Paxos variant. The core motivation is sound: EPaxos's global log caused all proposals to serialize, defeating concurrent key operations.


Architecture Assessment

The per-key Paxos design is a strong fit for the primary use case (immutable actor key reservation), and the simplification is well-executed. The fast/slow path split, per-key ballot namespace, and changelog-based learner catch-up are all clean improvements over the prior model.


Issues Found

Performance — send_prepare_round eager collection (medium)

In the prepare round, futures are collected via .collect::<FuturesUnordered<_>>().await before any processing, which blocks until all replicas respond. The send_accept_round function correctly uses StreamExt::next to allow early exit once a quorum is reached. The prepare round should do the same — processing responses as they arrive and short-circuiting once quorum (or failure) is determined. As-is, prepare latency is always bounded by the slowest replica.

Bug — legacy dual-read overhead has no termination signal (low)

ballot_selection always performs 4 parallel reads per proposal (v2 value, legacy committed, legacy v2, ballot). The legacy reads add overhead for as long as legacy data exists. The epoxy_backfill_v2 workflow migrates old entries, but there is no mechanism (config flag, completion signal, or similar) to disable dual-reads once backfill is complete. Every proposal pays this cost indefinitely.

Metrics fragility in tests (low)

proposal.rs uses global Prometheus counters and a TEST_LOCK mutex to serialize test execution. If any other test in the binary also increments these counters (e.g., via a concurrent integration test), the assertions on counter deltas will fail spuriously. Consider using snapshot-based counter diffing or resettable metrics in the test harness.

Documentation gap — ConsensusFailed retry semantics

PREPARE_RETRY_MAX_ATTEMPTS = 10 with exponential backoff bounds retries at ~10 seconds worst case. What the caller should do when ProposalResult::ConsensusFailed is returned is not documented at the call site or in the module-level doc. For a distributed system under high contention this is an observable failure mode that operators need to understand.

Documentation gap — notify_all_replicas infinite retry

#[max_retries = usize::MAX] on notify_all_replicas means a transient network issue can stall the coordinator indefinitely. A comment explaining the intended failure semantics (and any circuit-breaking condition) would help.


Test Coverage Gaps

  • No test for the changelog::apply_entry conflict error case (conflicting immutable value during learner catch-up). This is a non-recoverable error in the current implementation — it deserves a test and operator recovery documentation.
  • No test for PREPARE_RETRY_MAX_ATTEMPTS exhaustion behavior.
  • The reconfigure.rs changes (broadcast config before catch-up, new SendBeginLearningToAll) are not independently tested.

Minor / Style

  • legacy_v2_value_key and value_key are both KvValueKey::new(key.clone()) — they differ only in which subspace they are packed into. Consider naming the packed bindings more explicitly (e.g., packed_v2_value_key vs packed_legacy_v2_value_key) to reduce confusion at the pack call sites.
  • The asymmetry in legacy fallback handling — legacy-v1 entries get version: 0, mutable: false while legacy-v2 entries retain their stored fields — is correct but worth a comment given how easy it is to conflate the two paths.

Summary

The architectural direction is right and the implementation is generally clean. The main concerns are:

  1. send_prepare_round should allow early quorum exit (same pattern as send_accept_round)
  2. Dual-read overhead needs a termination path once backfill completes
  3. Document ConsensusFailed semantics and the infinite-retry coordinator for operator awareness

The new test suite (backfill, migration, snapshot, proposal) substantially improves coverage for the happy paths. The gaps are in failure/exhaustion cases which are worth adding before this lands in production.

@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from e9b0049 to 13343f0 Compare April 4, 2026 20:16
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 4, 2026 20:16 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 13343f0 to 5d7575d Compare April 4, 2026 20:21
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 4, 2026 20:21 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 5d7575d to f27d74e Compare April 4, 2026 23:20
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 4, 2026 23:20 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from f27d74e to 97679bc Compare April 5, 2026 00:54
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 00:54 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 97679bc to 0fd6f54 Compare April 5, 2026 05:24
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 05:24 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 0fd6f54 to 473caff Compare April 5, 2026 07:14
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 07:14 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 473caff to 72869d5 Compare April 5, 2026 07:54
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 07:54 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 72869d5 to 7b57d36 Compare April 5, 2026 08:56
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 08:56 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 7b57d36 to 47a80f6 Compare April 5, 2026 09:40
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 09:40 Destroyed
@NathanFlurry NathanFlurry changed the base branch from native-sqlite-kv-channel to graphite-base/4554 April 5, 2026 10:19
@NathanFlurry NathanFlurry marked this pull request as ready for review April 5, 2026 10:59
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 47a80f6 to 39e0d69 Compare April 5, 2026 11:10
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 11:10 Destroyed
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4554 to native-sqlite-kv-channel April 5, 2026 11:10
Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 5, 2026

Merge activity

  • Apr 5, 11:11 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 5, 11:47 AM UTC: Graphite rebased this pull request as part of a merge.
  • Apr 5, 11:47 AM UTC: @NathanFlurry merged this pull request with Graphite.

@NathanFlurry NathanFlurry changed the base branch from native-sqlite-kv-channel to graphite-base/4554 April 5, 2026 11:44
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4554 to main April 5, 2026 11:46
@NathanFlurry NathanFlurry force-pushed the 04-04-chore_remove_global_epoxy_contention branch from 39e0d69 to c28aab4 Compare April 5, 2026 11:46
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4554 April 5, 2026 11:46 Destroyed
@NathanFlurry NathanFlurry merged commit 7a19ce1 into main Apr 5, 2026
9 of 18 checks passed
@NathanFlurry NathanFlurry deleted the 04-04-chore_remove_global_epoxy_contention branch April 5, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants