Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Oct 27, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Oct 31, 2025 1:15am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Oct 31, 2025 1:15am
rivet-inspector Ignored Ignored Preview Oct 31, 2025 1:15am
rivet-site Ignored Ignored Preview Oct 31, 2025 1:15am

Copy link
Contributor Author

MasterPtato commented Oct 27, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Oct 28, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3275

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3275

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3275

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3275

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3275

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3275

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3275

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3275

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3275

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3275

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3275

commit: 3d340c9

@claude
Copy link

claude bot commented Oct 28, 2025

PR Review: WebSocket Message Acknowledgment Feature

Summary

Overall Assessment: Approve with Minor Suggestions

This PR adds WebSocket message acknowledgment and hibernation support to the runner protocol. Well-implemented feature with solid error handling that follows codebase conventions.


Positive Aspects

  • Strong protocol versioning with v2.bare schema and wrap_latest() for backward compatibility
  • Robust Rust state management with hibernation, wrapping arithmetic for u16 indices, and GC
  • Good error handling with WebSocketServiceRetry and WebSocketServiceTimeout types
  • Memory safety with 1 MiB size limits preventing unbounded growth
  • Comprehensive TypeScript state tracking with proper lifecycle management
  • Batched acknowledgments reduce message overhead
  • Event buffering handles race conditions

Critical Issues

1. GC Logic Inverted (shared_state.rs:420-421, 428-429)

The retention logic is backwards - keeps timed-out connections and removes valid ones.

Current: keep = now.duration_since(send_instant) > TIMEOUT
Should be: keep = now.duration_since(send_instant) <= TIMEOUT

Impact: Critical - causes premature cleanup of valid connections.

2. Race Condition in Entry::Occupied (shared_state.rs:103-108)

Clearing pending_msgs while they may be in flight could cause message loss when old msg_tx channel is dropped.

Suggestion: Drain pending messages before clearing or document expected behavior.

3. Missing Index Validation (tunnel.ts:713)

_handleMessage passes msg.index without validation before reaching __ackWebsocketMessage bounds check.


Medium Priority Issues

  • Integer conversions use try_into()? but are unreachable due to line 173 check - use expect() to document
  • Debug logging with placeholders like "------------ ACK" should be removed (lines 374, 120-124, 355-360)
  • Hardcoded canHibernate: true needs configuration option (tunnel.ts:641)

Low Priority

  • Git dependencies for vbare (0.0.2 -> 0.0.3) and rivet-term repo change - intentional?
  • Index 0 skipping documented as TODO but not resolved (line 137)
  • Commented-out wrapping_lt function should be removed (lines 456-458)

Security & Performance

Security: Good practices - size limits prevent exhaustion, timeouts prevent leaks, no unvalidated input. Minor message replay concern.

Performance: Batched acks and efficient storage. Consider VecDeque for O(1) retention operations.


Test Coverage Gaps

Missing tests for:

  • Wrapping arithmetic edge cases (u16 wraparound)
  • Message redelivery scenarios
  • Hibernation state persistence across reconnections
  • GC timeout boundaries

Recommendations

Must Fix Before Merge:

  1. Fix GC logic bug (lines 420-421, 428-429 in shared_state.rs)
  2. Address Entry::Occupied race condition
  3. Add index validation in TypeScript _handleMessage

Should Fix:

  1. Remove debug logging statements
  2. Document or fix index 0 skipping behavior

Nice to Have:

  • Configuration for canHibernate
  • VecDeque optimization for message queues
  • Integration tests for hibernation flow

Conclusion

Solid implementation of a complex distributed systems feature. The acknowledgment protocol is well-designed and the hibernation mechanism is creative. Critical bugs found are fixable with minimal changes. Will significantly improve WebSocket reliability once issues are addressed.

Recommendation: Approve with required fixes for GC logic and race condition.


Generated with Claude Code

@MasterPtato MasterPtato force-pushed the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch from 3d340c9 to 7edb49a Compare October 28, 2025 02:00
@claude
Copy link

claude bot commented Oct 28, 2025

Pull Request Review: WebSocket Message Acknowledgment Protocol

Overview

This PR adds message acknowledgment support to the runner protocol for WebSocket connections, enabling reliable message delivery with hibernation/reconnection support. This is a substantial enhancement to the WebSocket tunneling infrastructure.

Code Quality & Architecture

✅ Strengths

  1. Protocol Design: The message acknowledgment system using wrapping u16 indices is well-designed for handling wraparound gracefully with wrapping_gt() comparison logic.

  2. Error Handling: Good separation of retry vs. terminal errors with new WebSocketServiceRetry and WebSocketServiceTimeout errors. The distinction between retryable failures and actual errors improves resilience.

  3. Message Buffering: Smart hibernation state management with size limits (MAX_PENDING_MSGS_SIZE_PER_REQ) prevents unbounded memory growth.

  4. TypeScript SDK: The WebSocketTunnelAdapter has good event buffering to handle race conditions where events fire before listeners are attached.

🔍 Potential Issues

Critical

  1. Integer Overflow Risk (shared_state.rs:170-173)
hs.total_pending_ws_msgs_size += message_serialized.len() as u64;

if hs.total_pending_ws_msgs_size > MAX_PENDING_MSGS_SIZE_PER_REQ
	|| hs.pending_ws_msgs.len() >= u16::MAX as usize

The size tracking never decreases when messages are acknowledged. This will cause false positives and connection drops after sending 1MB total data, even if most was already acked.

Fix: Decrement total_pending_ws_msgs_size in ack_pending_websocket_messages():

pub async fn ack_pending_websocket_messages(...) -> Result<()> {
    // ... existing code ...
    hs.pending_ws_msgs.retain(|msg| {
        let keep = wrapping_gt(msg_index, ack_index);
        if !keep {
            hs.total_pending_ws_msgs_size -= msg.payload.len() as u64;
        }
        keep
    });
}
  1. Unused Deserialization (shared_state.rs:336-338)
let de = protocol::versioned::ToClient::deserialize_with_embedded_version(
    &pending_msg.payload,
);

This deserialization result de is never used. Either remove it or add error handling.

  1. Logic Inversion Bug (shared_state.rs:424-426)
keep = now.duration_since(earliest_pending_msg.send_instant)
    > MESSAGE_ACK_TIMEOUT;

This logic is inverted - it will keep requests that have timed out and drop ones that haven't. Should be < not >.

High Priority

  1. TODO Comment (shared_state.rs:137)
// TODO: This ends up skipping 0 as an index when initiated but whatever

Skipping index 0 could cause off-by-one issues in client tracking. This should be properly fixed or documented why it's acceptable.

  1. Missing Size Update (shared_state.rs:352-363)
    When retaining messages based on ack, you're not updating total_pending_ws_msgs_size. Same issue as [SVC-2555] Set up issue templates #1 but in a different code path.

  2. Race Condition (lib.rs:103-108)

Entry::Occupied(mut entry) => {
    entry.receiver_subject = receiver_subject;
    entry.msg_tx = msg_tx;
    entry.opened = false;
    entry.pending_msgs.clear();
}

When reusing an existing request entry, hibernation_state is not cleared. This could cause stale hibernation data to persist across reconnections.

Medium Priority

  1. Comment Accuracy (lib.rs:318)
// NOTE: This gets set in shared_state.ts

The file is actually shared_state.rs, not .ts. Minor typo but could confuse developers.

  1. Unused Function (shared_state.rs:461-463)
// fn wrapping_lt(a: u16, b: u16) -> bool {
//     b.wrapping_sub(a) < u16::MAX / 2
// }

Either remove commented code or add a comment explaining why it's kept for future use.

  1. Error Construction (lib.rs:175)
return Err(WebsocketPendingLimitReached {}.build());

The struct definition was changed to have no fields, but this could be more informative by including actual limit values for debugging.

  1. Typo in Comment (proxy_service.rs:1834-1835)
// Denotes that the connection did not fail, but needs to be retried to
// resole a new target

"resole" should be "resolve".

🔒 Security Concerns

  1. Memory Exhaustion: While there's a 1MB limit per request, with many concurrent WebSocket connections, total memory could still grow large. Consider adding global limits or connection count limits.

  2. Message Replay: The protocol allows resending messages based on last_msg_index, but there's no validation that the client isn't maliciously requesting replays of very old messages.

⚡ Performance Considerations

  1. Cloning Serialized Messages (shared_state.rs:181)
payload: message_serialized.clone(),

Every WebSocket message is cloned for hibernation storage. For high-throughput connections, this adds significant overhead. Consider using Arc<Vec<u8>> for the payload.

  1. Linear Search in GC (shared_state.rs:416-444)
    The garbage collection scans all in-flight requests every 15 seconds. With many connections, this could become expensive. Consider using a priority queue or heap for timeout tracking.

  2. Blocking Async Operations: Multiple .get_async() calls that immediately mutate could cause contention. Consider using entry_async() pattern more consistently.

🧪 Test Coverage

Missing Tests:

  • No unit tests for wrapping_gt() edge cases (wraparound scenarios)
  • No tests for hibernation state management
  • No tests for message replay logic
  • No tests for size limit enforcement
  • No integration tests for reconnection scenarios

Recommendation: Add tests for:

#[cfg(test)]
mod tests {
    #[test]
    fn test_wrapping_gt_wraparound() {
        assert!(wrapping_gt(1, u16::MAX));
        assert!(wrapping_gt(u16::MAX/2, 0));
        assert!(!wrapping_gt(u16::MAX, 1));
    }
    
    // Add tests for ack logic, hibernation, etc.
}

📋 Additional Notes

  1. Breaking Changes: The protocol version bump and schema changes are appropriately versioned.

  2. Dependency Update: The vbare dependency is now pulled from git instead of crates.io. Ensure this doesn't break builds for contributors without git access to that repo.

  3. Missing Newline: guard.websocket_pending_limit_reached.json and new error files are missing trailing newlines (minor).

Summary

This is a well-architected feature that significantly improves WebSocket reliability. However, there are several critical bugs that need to be fixed before merging:

Must Fix:

Should Fix:

Consider:

  • Add test coverage
  • Performance optimizations for high-throughput scenarios
  • Document the protocol changes and migration guide

Overall: Approve with required changes ✅ (pending fixes for critical issues)

Great work on this complex feature! The acknowledgment protocol design is solid, just needs the implementation bugs ironed out.

@MasterPtato MasterPtato force-pushed the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch from 7edb49a to e662b92 Compare October 29, 2025 01:53
@MasterPtato MasterPtato force-pushed the 10-24-fix_improve_reconnection_with_sleeping_and_tunnel branch from 049fe1f to dafcccf Compare October 29, 2025 01:53
@claude
Copy link

claude bot commented Oct 29, 2025

Code Review: WebSocket Message Acknowledgment System

I've completed a thorough review of PR #3275 which adds WebSocket message acknowledgment to the runner protocol. Here's my comprehensive feedback:

Overview

This PR implements a sophisticated message acknowledgment system for WebSocket communications between the gateway and runners, enabling reliable message delivery with hibernation support. The implementation appears solid overall with good error handling patterns.


🟢 Strengths

1. Protocol Design

  • Clean protocol extension with v2 schema that maintains backward compatibility
  • Well-structured message types: TunnelAck, ToServerWebSocketMessageAck
  • Smart use of message indices (u16) with wrapping arithmetic for hibernation scenarios

2. Hibernation State Management

  • Excellent HibernationState tracking with pending message queues
  • Proper bounds checking on pending messages (1 MiB limit, u16::MAX count)
  • Message resend logic correctly handles wrapping indices with wrapping_gt() helper

3. Error Handling

  • Good use of custom error types following the codebase patterns:
    • WebSocketServiceRetry - signals retry without counting as attempt
    • WebSocketServiceTimeout - explicit timeout handling
    • WebsocketPendingLimitReached - backpressure protection
  • Proper error propagation through the system

4. Timeout Management

  • MESSAGE_ACK_TIMEOUT (30s) and GC_INTERVAL (15s) are reasonable defaults
  • Garbage collection properly cleans up stale in-flight requests
  • Timeout handling in TunnelMessageData::Timeout variant

🟡 Issues & Concerns

High Priority

1. Logic Bug in GC Retention (engine/packages/pegboard-gateway/src/shared_state.rs:420-422)

let mut keep = true;

if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    keep = now.duration_since(earliest_pending_msg.send_instant)
        > MESSAGE_ACK_TIMEOUT;  // ❌ Should be: < MESSAGE_ACK_TIMEOUT or use <
}

Problem: The condition is inverted. Currently it keeps requests that have exceeded the timeout and removes those that haven't, which is backwards.

Expected: Keep requests where the oldest pending message is within the timeout window.

Fix:

keep = now.duration_since(earliest_pending_msg.send_instant) < MESSAGE_ACK_TIMEOUT;

The same issue exists at line 429 for hibernation state.

2. Potential Index Skip Issue (engine/packages/pegboard-gateway/src/shared_state.rs:137)

// TODO: This ends up skipping 0 as an index when initiated but whatever
msg.index = hs.last_ws_msg_index.wrapping_add(1);

The TODO suggests index 0 is skipped on init. This could cause issues:

  • If client expects indices starting at 0, there will be a mismatch
  • Could complicate hibernation recovery if first message is lost
  • The comment "but whatever" suggests this wasn't fully considered

Recommendation: Either properly initialize to start at 0, or document why skipping 0 is acceptable.

Medium Priority

3. Commented Debug Code (multiple files)

// fn wrapping_lt(a: u16, b: u16) -> bool {
//     b.wrapping_sub(a) < u16::MAX / 2
// }

Dead code should be removed rather than commented out.

4. Inconsistent Error Variable Naming (engine/packages/pegboard-gateway/src/shared_state.rs:230-235)

The ack handling has this warning:

if prev_len == in_flight.pending_msgs.len() {
    tracing::warn!("pending message does not exist or ack received after message body")
}

This suggests a potential race condition. Consider:

  • Is it valid to receive an ack for a message we don't track?
  • Should this be an error rather than a warning?
  • Document the expected flow to clarify when this is benign vs problematic

5. Missing Index Validation

When acking messages via ack_pending_websocket_messages, there's no validation that ack_index is reasonable. Consider adding bounds checking to detect protocol errors early.

Low Priority

6. Hardcoded Limits (engine/packages/pegboard-gateway/src/shared_state.rs:18)

const MAX_PENDING_MSGS_SIZE_PER_REQ: u64 = util::size::mebibytes(1);

Consider making this configurable per-actor or per-runner for different use cases (e.g., high-throughput actors might need more buffer).

7. Tracing Improvements

Several trace points could benefit from structured fields:

// Current
tracing::debug!("gateway waiting for response from tunnel");

// Better
tracing::debug!(request_id=?Uuid::from_bytes(request_id), "gateway waiting for response from tunnel");

This would help with debugging production issues.


📋 Code Quality

Style & Conventions ✅

  • Follows Rust formatting with hard tabs (per rustfmt.toml)
  • Imports properly organized at file top (not inline)
  • Uses anyhow::* for error handling as per guidelines
  • Proper use of workspace dependencies

Logging Patterns ✅

  • Generally good use of structured logging with tracing
  • Most log messages are lowercase as per conventions
  • Uses ?field notation for debug formatting

Error Patterns ✅

  • Proper #[derive(RivetError)] usage
  • Error groups ("guard", "ws") follow domain organization
  • Includes Serialize, Deserialize for errors with metadata

🔒 Security Considerations

✅ Good

  • Proper bounds checking on message queue sizes prevents memory exhaustion
  • Timeout mechanisms prevent infinite waits
  • Rate limiting preserved from existing proxy implementation

⚠️ Consider

  • DoS potential: A malicious runner could send many messages without acks to fill buffers
    • Mitigation: The 1 MiB limit helps, but consider per-runner global limits
  • Audit trail: Consider logging when hibernation limits are hit for monitoring

🧪 Testing Recommendations

  1. Unit Tests Needed:

    • Wrapping arithmetic in wrapping_gt() with edge cases (0, u16::MAX, wrap-around)
    • Message ack retention logic with various index sequences
    • GC behavior with timeout edge cases
  2. Integration Tests:

    • Hibernation cycle: send messages → hibernate → wake → verify resend
    • Network partition simulation: delay acks beyond timeout
    • High message volume to test backpressure limits
  3. Consider adding:

    • Proptest/quickcheck for wrapping arithmetic invariants
    • Load tests for concurrent websocket connections

📝 Minor Nitpicks

  1. Cargo.toml version changes: The version bump from 25.8.2 to 2.0.22-rc.1 looks like a major version regression. Verify this is intentional (possibly a rebranding/renumbering?).

  2. vbare dependency: Changed from crates.io to git with specific rev. Ensure this is necessary and the dependency is stable.

  3. Grafana dashboard changes: The SQL query updates use IN () instead of $__conditionalAll(). Verify this maintains the same filtering behavior.


✅ Recommendations Summary

Must Fix:

  1. Fix GC retention logic inversion (lines 420-422, 429-430 in shared_state.rs)
  2. Resolve or properly document the index-0 skipping issue

Should Fix:
3. Remove commented dead code
4. Add validation for ack indices
5. Document/improve the "ack after message" warning case

Consider:
6. Make buffer limits configurable
7. Add comprehensive test coverage
8. Enhance structured logging
9. Document hibernation protocol flow


Overall Assessment

This is a well-designed feature that adds important reliability to WebSocket communications. The core architecture is sound, but there's a critical logic bug in the GC code that needs fixing before merge. The hibernation mechanism is clever and the error handling is comprehensive.

With the fixes above, this PR should provide robust message delivery with minimal performance overhead.

Recommended action: Request changes for the GC retention logic bug, then approve after fix.


Generated with Claude Code

@claude
Copy link

claude bot commented Oct 29, 2025

Pull Request Review: WebSocket Message Acknowledgment Protocol

This PR adds a critical reliability feature to the runner protocol - WebSocket message acknowledgments for guaranteed message delivery during actor hibernation. The implementation is generally solid with good attention to reliability patterns. Here are my findings:

✅ Strengths

Protocol Design

  • Well-structured protocol extension: The addition of TunnelAck, ToServerWebSocketMessageAck, and message indexing (index: u16) to ToClientWebSocketMessage is clean and follows the existing pattern
  • Backward compatibility: Changes to the BARE schema (v2.bare) are additive and don't break existing implementations
  • Proper message sequencing: Using wrapping arithmetic (wrapping_add, wrapping_sub) for u16 indices correctly handles overflow at 65,535 messages

Implementation Quality

  • Comprehensive error handling: New error types (WebSocketServiceRetry, WebSocketServiceTimeout) provide clear feedback
  • Memory safety: The pending message limit (MAX_PENDING_MSGS_SIZE_PER_REQ = 1 MiB) prevents unbounded memory growth
  • Timeout management: MESSAGE_ACK_TIMEOUT = 30s and TUNNEL_ACK_TIMEOUT = 2s provide reasonable bounds
  • Good separation of concerns: SharedState manages message lifecycle independently from gateway logic

Reliability Features

  • Message persistence: Pending messages are stored and replayed after hibernation (engine/packages/pegboard-gateway/src/shared_state.rs:312-369)
  • Garbage collection: Automatic cleanup of stale requests via gc() task (shared_state.rs:403-442)
  • Proper retry logic: WebSocket errors trigger route resolution retry with backoff (lib.rs:296, 300)

⚠️ Issues & Recommendations

🔴 Critical: Wrapping Comparison Logic

Location: engine/packages/pegboard-gateway/src/shared_state.rs:453-455

fn wrapping_gt(a: u16, b: u16) -> bool {
    a != b && a.wrapping_sub(b) < u16::MAX / 2
}

Issue: The wrapping comparison function has a potential edge case. When exactly at the midpoint (32,767 messages apart), the comparison becomes ambiguous.

Recommendation: Document this limitation clearly or consider using a more robust wrapping comparison:

fn wrapping_gt(a: u16, b: u16) -> bool {
    a != b && a.wrapping_sub(b) < (u16::MAX / 2) + 1
}

This ensures the comparison is consistent even at the boundary.

🟡 High: GC Logic Inverted

Location: engine/packages/pegboard-gateway/src/shared_state.rs:420-422

keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;

Issue: The garbage collection logic appears inverted. Currently it keeps requests when they've exceeded the timeout, but should remove them. This will cause timed-out requests to never be cleaned up.

Fix: Should be:

keep = now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT;

Same issue on line 429-430 for hibernated WebSocket messages.

🟡 Medium: Potential Message Loss on Reconnect

Location: engine/packages/pegboard-gateway/src/shared_state.rs:324-341

When resending pending WebSocket messages after hibernation, the code filters based on last_msg_index:

if last_msg_index < 0 || wrapping_gt(msg_index, last_msg_index.try_into()?) {
    self.ups.publish(&receiver_subject, &pending_msg.payload, PublishOpts::one()).await?;
}

Concern: If a message send fails here (network error, etc.), the error propagates but the pending messages aren't marked for retry. Consider wrapping this in a retry loop or at minimum logging which messages failed to resend.

🟡 Medium: Index Skipping in Hibernation

Location: engine/packages/pegboard-gateway/src/shared_state.rs:137-138

// TODO: This ends up skipping 0 as an index when initiated but whatever
msg.index = hs.last_ws_msg_index.wrapping_add(1);

Issue: This TODO indicates a known issue where index 0 is never used. While not breaking, this should be fixed for completeness:

// On initialization
req.hibernation_state = Some(HibernationState {
    total_pending_ws_msgs_size: 0,
    last_ws_msg_index: u16::MAX,  // Wraps to 0 on first message
    pending_ws_msgs: Vec::new(),
});

🟠 Low: TypeScript Error Handling

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:431-434

The _handleMessage function returns a boolean to indicate if the message was consumed, but this return value doesn't appear to be used anywhere. Consider:

  1. Documenting what the return value means
  2. Using it in the caller, or
  3. Removing it if unnecessary

🟠 Low: Missing Bounds Check

Location: engine/packages/pegboard-gateway/src/shared_state.rs:173

if hs.total_pending_ws_msgs_size > MAX_PENDING_MSGS_SIZE_PER_REQ
    || hs.pending_ws_msgs.len() >= u16::MAX as usize
{
    return Err(WebsocketPendingLimitReached {}.build());
}

Suggestion: The check should be > u16::MAX as usize instead of >= since u16::MAX (65,535) is a valid count.

📋 Code Quality Observations

Good Practices ✅

  • Consistent error handling using the custom RivetError system
  • Proper use of structured logging with tracing (includes request_id, message_id)
  • Timeout guards on all async operations
  • Clear separation between gateway and runner concerns

Minor Improvements 🔧

  1. Naming: WebsocketPendingLimitReached should be WebSocketPendingLimitReached (capital S) for consistency with other error names
  2. Documentation: The protocol changes would benefit from a comment block explaining the ack flow
  3. Testing: While I don't see test changes in this diff, ensure integration tests cover:
    • Message replay after hibernation
    • Proper ack handling under load
    • Timeout and retry scenarios

🔒 Security Considerations

Memory Limits: The 1 MiB limit per request prevents memory exhaustion attacks
Timeout Guards: All operations have reasonable timeouts
No Credential Exposure: The protocol changes don't introduce credential leakage risks
⚠️ DoS Consideration: A malicious client could potentially create many in-flight requests with pending messages. Consider adding a global limit on total pending message memory across all requests.

📊 Performance Notes

  • Memory overhead: Each pending message stores full serialized payload (~150-300 bytes typical). With 1 MiB limit, this is ~3,000-6,000 messages max per connection
  • GC frequency: 15-second intervals are reasonable for cleanup
  • Wrapping arithmetic: Efficient for index management, no performance concerns

🎯 Recommendations Summary

Before Merge:

  1. 🔴 Fix the GC logic inversion (lines 420-422, 429-430 in shared_state.rs)
  2. 🟡 Review and fix wrapping comparison edge case
  3. 🟡 Add error handling for message resend failures

Follow-up:

  1. Fix index 0 skipping in hibernation initialization
  2. Add integration tests for hibernation scenarios
  3. Consider global memory limits across all connections
  4. Document the ack protocol flow in code comments

Overall Assessment

This is a well-designed feature that significantly improves WebSocket reliability during actor hibernation. The core architecture is sound, but the GC logic bug needs fixing before merge. The wrapping arithmetic and message management are implemented correctly overall.

Recommendation: Request changes for the GC logic fix, then approve pending successful testing.


Reviewed with attention to CLAUDE.md conventions - structured logging, error patterns, and dependency management all follow project standards.

@claude
Copy link

claude bot commented Oct 29, 2025

Pull Request Review: WebSocket Message Acknowledgment Protocol

Overview

This PR introduces a WebSocket message acknowledgment system to the runner protocol (v2), enabling reliable message delivery with hibernation support. The implementation adds message indexing, acknowledgment tracking, and automatic retry/resend capabilities.


Positive Aspects

Well-Designed Protocol

  • Clean separation of v1 and v2 protocols allows backward compatibility
  • Wrapping arithmetic (u16 index with wrapping_gt) correctly handles index overflow
  • Bounded message queue (1 MiB, 65535 messages) prevents unbounded memory growth

Proper Error Handling

  • Three new error types properly categorized by retriability:
    • WebSocketServiceUnavailable: Transient failures (retryable with backoff)
    • WebSocketServiceRetry: Server-initiated retries (immediate, no backoff)
    • WebSocketServiceTimeout: Ack timeout failures (non-retryable)
  • Error messages follow the project's RivetError conventions

Comprehensive Implementation

  • Gateway-side tracking with HibernationState for pending messages
  • Client SDK integration with TypeScript runner
  • Retry logic with exponential backoff (3 attempts for transient failures)
  • Message replay on reconnection after hibernation

Issues & Concerns

🔴 Critical: Timeout Mismatch

Location: packages/pegboard-gateway/src/shared_state.rs:17 vs engine/sdks/typescript/runner/src/tunnel.ts:10

Gateway timeout: 30 seconds

const MESSAGE_ACK_TIMEOUT: Duration = Duration::from_secs(30);

Runner timeout: 5 seconds

const MESSAGE_ACK_TIMEOUT = 5000; // 5 seconds

Impact: The runner may close connections or retry 6x earlier than the gateway expects. This could lead to:

  • Unnecessary retries and wasted resources
  • False positive timeout errors on the runner side
  • Inconsistent behavior under load

Recommendation: Synchronize these timeouts. Consider making the runner timeout slightly shorter (e.g., 28s) to give it time to initiate cleanup before gateway timeout.


⚠️ High Priority: Missing Test Coverage

No tests found for:

  1. Message acknowledgment flow (send → ack → removal from pending queue)
  2. Index wrapping behavior (when index wraps from 65535 → 0)
  3. Hibernation and message replay scenarios
  4. Timeout and retry logic
  5. Concurrent message sending/acking edge cases

Existing test: guard-core/tests/simple_websocket_test2.rs is a basic echo test and doesn't exercise the new ack mechanism.

Recommendation: Add integration tests covering:

#[tokio::test]
async fn test_websocket_message_ack_basic() { /* ... */ }

#[tokio::test]
async fn test_websocket_hibernation_replay() { /* ... */ }

#[tokio::test]
async fn test_websocket_index_wrapping() { /* ... */ }

#[tokio::test]
async fn test_websocket_ack_timeout() { /* ... */ }

#[tokio::test]
async fn test_pending_message_limit() { /* ... */ }

⚠️ Medium: Potential Data Loss on Acknowledgment

Location: packages/pegboard-gateway/src/shared_state.rs:387-398

The ack_pending_websocket_messages function removes messages from the pending queue without updating total_pending_ws_msgs_size:

hs.pending_ws_msgs.retain(|_| {
    let msg_index = hs.last_ws_msg_index
        .wrapping_sub(len)
        .wrapping_add(1)
        .wrapping_add(iter_index);
    let keep = wrapping_gt(msg_index, ack_index);
    iter_index += 1;
    keep
});

Issue: total_pending_ws_msgs_size is incremented when messages are added (line 170) but never decremented when acknowledged.

Impact:

  • total_pending_ws_msgs_size will grow indefinitely
  • Eventually hits MAX_PENDING_MSGS_SIZE_PER_REQ limit even if all messages are acknowledged
  • Causes WebsocketPendingLimitReached errors for active connections

Recommendation: Track size of removed messages and update the counter:

let mut removed_size = 0u64;
hs.pending_ws_msgs.retain(|msg| {
    let msg_index = /* calculation */;
    let keep = wrapping_gt(msg_index, ack_index);
    if \!keep {
        removed_size += msg.payload.len() as u64;
    }
    keep
});
hs.total_pending_ws_msgs_size = hs.total_pending_ws_msgs_size.saturating_sub(removed_size);

⚠️ Medium: Race Condition in Message Indexing

Location: packages/pegboard-gateway/src/shared_state.rs:134-143

msg.index = hs.last_ws_msg_index.wrapping_add(1);
// ... later ...
hs.last_ws_msg_index = ws_msg_index; // line 178

Issue: If multiple messages are sent concurrently (async context), last_ws_msg_index could be read by two tasks before either writes back, leading to duplicate indices.

Likelihood: Low (depends on concurrent access patterns), but possible with high-throughput WebSocket connections.

Recommendation:

  • Document that send_message should not be called concurrently for the same request
  • OR use atomic operations/locks if concurrent sends are expected
  • OR return the index from the send operation to make the read-modify-write atomic

🟡 Low: Index 0 Skipped on Initialization

Location: packages/pegboard-gateway/src/shared_state.rs:137

// TODO: This ends up skipping 0 as an index when initiated but whatever
msg.index = hs.last_ws_msg_index.wrapping_add(1);

Issue: The first message will have index 1, not 0 (since last_ws_msg_index starts at 0).

Impact: Low - wastes one index value but doesn't affect correctness due to wrapping arithmetic.

Recommendation: Either:

  1. Initialize last_ws_msg_index to u16::MAX so first message gets index 0
  2. Remove the TODO and document this is intentional
  3. Keep as-is if the skip is acceptable

🟡 Low: Incomplete Size Accounting in Resend

Location: packages/pegboard-gateway/src/shared_state.rs:343-364

Similar to the ack function, resend_pending_websocket_messages retains messages (line 348) but doesn't update total_pending_ws_msgs_size.

Same fix needed as for ack_pending_websocket_messages.


Code Quality

Good Practices

  • Follows CLAUDE.md conventions (error macros, imports, logging)
  • Proper use of structured logging with tracing
  • Uses workspace dependencies correctly
  • Added tokio-tungstenite to guard package appropriately

⚠️ Areas for Improvement

  • Missing inline documentation for complex logic (wrapping arithmetic, retention algorithms)
  • No unit tests for wrapping_gt helper function
  • Some magic numbers (30s timeout, 1 MiB limit) could be configurable

Performance Considerations

Efficient Design

  • O(1) message sending with index increment
  • Bounded memory usage per connection
  • Efficient retain operation for cleanup

⚠️ Potential Concerns

  1. O(n) retain operations on every ack could impact high-throughput connections
    • Consider batch acking or deferred cleanup
  2. Message serialization duplication: Messages are serialized and stored even if client may not need replay
    • Could optimize for non-hibernation case
  3. GC interval (15s) might be too aggressive for checking timeouts
    • Consider tuning based on load testing

Security Concerns

Good Security Practices

  • Bounded message queue prevents memory exhaustion attacks
  • Timeout mechanisms prevent indefinite resource holding
  • No obvious injection vulnerabilities

🟢 Minor Suggestions

  • Consider logging/metrics for pending message size approaching limits (DoS detection)
  • Document maximum memory usage per connection for capacity planning

Dependency Changes

⚠️ Git Dependency Concern

Changes:

[workspace.dependencies.vbare]
-version = "0.0.2"
+git = "https://github.com/rivet-dev/vbare"
+rev = "3ae474a0234801bb96d70bec4eddd4f2d640971e"

Issue: Switching from crates.io version to git dependency.

Questions:

  1. Why is this change needed for the ack feature?
  2. Are there breaking changes in vbare that haven't been published?
  3. Should this be a separate PR since it's unrelated to ack functionality?

Recommendation: Either:

  • Explain the necessity in PR description
  • Split into separate PR if unrelated
  • Publish vbare 0.0.3 to crates.io if stable

Similar concern for rivet-term org change (rivet-ggrivet-dev).


Miscellaneous

📝 Non-functional Changes

  • Version bumps (25.8.2 → 2.0.22-rc.1) across all packages - appears to be release preparation
  • Grafana dashboard query improvements (better ClickHouse query syntax)
  • Error artifact JSON formatting (minor whitespace)

These are fine but make the PR large. Consider separating version bumps and tooling improvements from feature PRs in the future for easier review.


Summary & Recommendations

Before Merging:

  1. MUST FIX: Synchronize timeout values between gateway (30s) and runner (5s)
  2. MUST FIX: Update total_pending_ws_msgs_size when messages are acknowledged/removed
  3. SHOULD ADD: Integration tests for ack flow, hibernation, and edge cases
  4. SHOULD REVIEW: Concurrent access safety for message indexing
  5. SHOULD CLARIFY: vbare and rivet-term dependency changes

Post-Merge:

  • Monitor memory usage and ack timing in production
  • Consider performance optimization for high-throughput scenarios
  • Add metrics for pending message queue sizes and ack latency

Overall Assessment

Score: 7/10

This is a well-architected feature with good error handling and retry logic. The protocol design is sound, and the implementation follows project conventions. However, the timeout mismatch and missing size accounting are critical bugs that must be fixed before merge. Test coverage is insufficient for a reliability-critical feature like message acknowledgment.

Recommendation: REQUEST CHANGES

  • Fix critical bugs (timeouts, size accounting)
  • Add integration tests
  • Clarify dependency changes

Once addressed, this will be a solid addition to enable reliable WebSocket communication with hibernation support.


Great work on tackling a complex distributed systems problem! The foundation is strong - just needs some polish before production. 🚀

@claude
Copy link

claude bot commented Oct 29, 2025

Pull Request Review: WebSocket Message Acknowledgment

Overview

This PR adds WebSocket message acknowledgment to the runner protocol (v2), enabling reliable message delivery with hibernation support for actors. The changes are substantial but well-structured.

📋 Summary of Changes

  • Runner Protocol v2: New schema with message acknowledgment primitives
  • Pegboard Gateway: Comprehensive message tracking and resend logic
  • Guard Core: New error types for WebSocket retry scenarios
  • Epoxy: New purger workflow for KV cleanup
  • TypeScript SDK: Updated to handle new protocol features

✅ Strengths

1. Well-Designed Protocol

The new v2 protocol schema is clean and comprehensive:

  • TunnelAck primitive for bidirectional acknowledgments
  • ToServerWebSocketMessageAck for explicit ack messages
  • lastMsgIndex tracking for hibernation recovery
  • Proper separation of concerns between message types

2. Robust Message Tracking

The SharedState implementation in pegboard-gateway/shared_state.rs is well thought out:

  • Pending message tracking with timeouts (30s)
  • Hibernation state with message buffering
  • Wrapping index arithmetic for u16 overflow handling
  • Size limits to prevent memory exhaustion (MAX_PENDING_MSGS_SIZE_PER_REQ)

3. Error Handling

Good addition of specific error types:

  • WebSocketServiceRetry - for hibernation reconnects
  • WebSocketServiceTimeout - for timeout scenarios
  • WebsocketPendingLimitReached - for backpressure

4. Graceful Degradation

The retry logic with hibernation support allows actors to:

  • Reconnect after temporary disconnections
  • Resume from last acknowledged message
  • Handle network instability gracefully

⚠️ Issues & Concerns

1. Potential Race Condition in GC Logic (Minor)

Location: pegboard-gateway/src/shared_state.rs:420-423

if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    keep = now.duration_since(earliest_pending_msg.send_instant)
        > MESSAGE_ACK_TIMEOUT;
}

Issue: The keep logic appears inverted. If the duration exceeds the timeout, we're setting keep = true, which means we'll KEEP the request, but we should be removing it.

Suggestion: Should be:

keep = now.duration_since(earliest_pending_msg.send_instant)
    <= MESSAGE_ACK_TIMEOUT;

Or send a timeout message and remove it:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    keep = false;
}

2. Wrapping Index Arithmetic Needs Validation (Medium)

Location: pegboard-gateway/src/shared_state.rs:453-455

fn wrapping_gt(a: u16, b: u16) -> bool {
    a != b && a.wrapping_sub(b) < u16::MAX / 2
}

Issue: The wrapping comparison logic is clever but needs careful review:

  • This assumes sequence numbers wrap within the window of u16::MAX / 2
  • If messages are delayed by more than ~32K messages, this could fail
  • Consider documenting the assumptions and constraints

Suggestion: Add documentation explaining:

  • Maximum window size
  • What happens if messages arrive out of order beyond the window
  • Test cases for wraparound scenarios

3. Missing Backpressure on Receiver (Medium)

Location: pegboard-gateway/src/shared_state.rs:199-287

The receiver task processes messages but doesn't have explicit backpressure if the channel fills up. The channel is sized at 128:

let (msg_tx, msg_rx) = mpsc::channel(128);

Issue: If a consumer can't keep up, messages could be dropped silently.

Suggestion: Consider:

  • Monitoring channel capacity metrics
  • Logging when close to capacity
  • Implementing explicit backpressure signals

4. Hard-Coded Timeout Values (Minor)

Locations: Multiple files

const TUNNEL_ACK_TIMEOUT: Duration = Duration::from_secs(2);
const MESSAGE_ACK_TIMEOUT: Duration = Duration::from_secs(30);

Suggestion: Consider making these configurable per-actor or environment, especially for testing.

5. Incomplete Error Context (Minor)

Location: pegboard-gateway/src/shared_state.rs:123-125

let mut req = self
    .in_flight_requests
    .get_async(&request_id)
    .await
    .context("request not in flight")?;

Suggestion: Include the request_id in the error context for debugging:

.with_context(|| format!("request not in flight: {}", Uuid::from_bytes(request_id)))?;

6. Potential Memory Leak in Hibernation (Medium)

Location: pegboard-gateway/src/shared_state.rs:169-186

If a WebSocket connection never reconnects, pending messages will accumulate until:

  1. The size limit is hit (MAX_PENDING_MSGS_SIZE_PER_REQ)
  2. The GC timeout fires

Issue: The GC timeout for pending WS messages is checked (line 425-431), but if new messages keep arriving within the timeout window, they'll never be cleaned up.

Suggestion: Add:

  • Maximum age for hibernated connections
  • Metrics for hibernation duration
  • Explicit cleanup after X minutes of hibernation

7. Missing Test Coverage (High)

Observation: No new test files added for:

  • WebSocket acknowledgment flow
  • Hibernation message replay
  • Wrapping index arithmetic
  • Timeout scenarios
  • GC behavior

Suggestion: Add integration tests covering:

  • Normal ack flow
  • Hibernation and reconnect
  • Message replay with gaps
  • Index wraparound
  • Timeout and cleanup

🔒 Security Considerations

1. DoS via Message Accumulation

The size limit (MAX_PENDING_MSGS_SIZE_PER_REQ = 1 MiB) provides good protection, but consider:

  • Per-actor rate limiting on message count
  • Metrics for actors hitting limits frequently

2. No Authentication on Ack Messages

Ack messages don't appear to be authenticated beyond the request_id/message_id pairing. This is probably fine given the architecture but worth documenting.

🎯 Performance Considerations

1. Lock Contention

The HashMap from scc is good for concurrency, but operations like get_async still acquire locks. With high message rates, this could become a bottleneck.

Suggestion: Consider performance testing under load with:

  • Many concurrent WebSocket connections
  • High message rates
  • Frequent hibernation/reconnection

2. Memory Overhead

Each in-flight request maintains:

  • Pending message list
  • Hibernation state with buffered messages
  • Channel for incoming messages

Suggestion: Add metrics for:

  • Total in-flight requests
  • Average pending message count
  • Memory usage per request

📝 Code Quality

1. Good Logging

The logging throughout is excellent with structured fields and appropriate levels.

2. Consistent Error Handling

Good use of the custom error system with RivetError.

3. Documentation

Some complex logic (like wrapping arithmetic) needs more inline documentation.

🚀 Recommendations

Priority: High

  1. Fix the GC keep logic (potential bug)
  2. Add comprehensive test coverage
  3. Address hibernation memory leak scenario

Priority: Medium

  1. Document wrapping index assumptions
  2. Add performance metrics
  3. Improve error context with IDs
  4. Test under load

Priority: Low

  1. Make timeouts configurable
  2. Add backpressure monitoring
  3. Document security assumptions

✨ Overall Assessment

This is a well-designed feature that adds important reliability to the WebSocket communication layer. The implementation is generally solid with good separation of concerns and error handling.

The main concerns are:

  • Potential GC bug that needs immediate attention
  • Missing test coverage for complex message replay logic
  • Some edge cases around long-lived hibernation

Recommendation: Request fixes for the high-priority items before merging. The implementation quality is good, but the lack of tests for such critical functionality is a concern.


Review generated with Claude Code

Base automatically changed from 10-24-fix_improve_reconnection_with_sleeping_and_tunnel to main October 31, 2025 01:14
@NathanFlurry NathanFlurry force-pushed the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch from a7d0289 to a3e2e41 Compare October 31, 2025 01:14
@NathanFlurry NathanFlurry merged commit 79bee4d into main Oct 31, 2025
8 of 12 checks passed
@NathanFlurry NathanFlurry deleted the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch October 31, 2025 01:15
@claude
Copy link

claude bot commented Oct 31, 2025

PR Review: WebSocket Message Acknowledgment Implementation

Overview

This PR adds a message acknowledgment mechanism to the runner protocol for WebSocket communications. The changes enable reliable message delivery tracking between the gateway and runners, which is critical for production reliability.


Positive Aspects ✅

1. Well-Structured Protocol Changes

  • The v2.bare schema additions are clean and minimal:
    • ToServerWebSocketMessageAck with index tracking (line 260-262)
    • TunnelAck void type for bidirectional acknowledgments (line 199)
  • Proper versioning maintained with new protocol schema

2. Comprehensive Error Handling

  • Added three new error types in guard-core/src/errors.rs:
    • WebSocketServiceRetry - for retryable WebSocket failures
    • WebSocketServiceTimeout - for timeout scenarios
    • WebSocketServiceUnavailable - for service availability issues
  • Error artifacts properly generated with JSON definitions

3. Robust Acknowledgment Tracking

  • Gateway side (pegboard-gateway/shared_state.rs):
    • Tracks pending messages with timestamps
    • Implements timeout handling (30s MESSAGE_ACK_TIMEOUT)
    • Garbage collection for stale messages (15s GC_INTERVAL)
    • Per-request pending message limits (1 MiB MAX_PENDING_MSGS_SIZE_PER_REQ)
  • Client side (TypeScript SDK):
    • 5s timeout for message acknowledgments
    • Proper cleanup on timeout with request rejection

4. TypeScript SDK Improvements

  • WebSocketTunnelAdapter properly handles message indices
  • Added __closeWithRetry method for graceful retry handling
  • Event buffering mechanism prevents race conditions with listener attachment

Issues & Concerns ⚠️

1. Potential Memory Leak Risk 🔴 HIGH PRIORITY

Location: engine/packages/pegboard-gateway/src/shared_state.rs:156-186

The send_message function adds messages to pending_msgs but there's no code path shown that removes acknowledged messages from this vector. The GC loop only removes timed-out messages.

Issue:

req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

Recommendation:
Add an acknowledgment handler that removes messages from pending_msgs when acks are received:

// In the receiver function, when TunnelAck is received:
if let Some(req) = self.in_flight_requests.get_async(&request_id).await {
    req.pending_msgs.retain(|msg| msg.message_id \!= message_id);
}

2. Missing Null Safety Check 🟡 MEDIUM PRIORITY

Location: engine/packages/guard-core/src/proxy_service.rs:722

let status = match &res {
    Ok(resp) => resp.status().as_u16().to_string(),
    Err(_) => "error".to_string(),  // Could provide more detail
};

Recommendation: Consider logging the actual error for debugging while maintaining the metric label.

3. Inconsistent Timeout Values

  • Gateway expects acks within 30 seconds (MESSAGE_ACK_TIMEOUT = Duration::from_secs(30))
  • TypeScript client times out after 5 seconds (MESSAGE_ACK_TIMEOUT = 5000)

Concern: This mismatch could lead to client-side timeouts while the gateway still considers messages pending.

Recommendation: Align these values or add comments explaining the rationale for different timeouts.

4. Hibernation State Management Complexity

Location: engine/packages/pegboard-gateway/src/shared_state.rs:134-186

The hibernation state logic interleaves with acknowledgment tracking, making the code path complex:

  • Message indices are conditionally updated
  • Pending WS messages are tracked separately
  • Size limits are enforced

Recommendation:

  • Add more inline documentation explaining the hibernation flow
  • Consider extracting hibernation logic into a separate method for clarity

5. Error Propagation in WebSocket Close

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:196

The closeInner function doesn't handle potential errors from #closeCallback:

closeInner(code?: number, reason?: string, retry: boolean = false): void {
    // ...
    this.#closeCallback(code, reason, retry);  // No error handling
    this.#readyState = 3; // CLOSED
    // ...
}

Recommendation: Wrap in try-catch or document that callbacks should not throw.


Code Quality & Best Practices ✅

Positive:

  1. Follows CLAUDE.md conventions:

    • Uses structured logging with tracing macros
    • Proper timestamp naming (send_instant, not sendInstant)
    • Lowercase log messages
    • Uses anyhow::* for error handling
  2. Dependency Management:

    • Properly uses workspace dependencies
    • Updated vbare to git dependency (rev: 3ae474a)
  3. Testing Considerations:

    • Changes are backward compatible (new v2 protocol schema)
    • Existing tests should continue to pass

Areas for Improvement:

  1. Missing Unit Tests: No new test files added for acknowledgment logic
  2. Documentation: Limited inline documentation for complex state management
  3. Metrics: No new metrics added for tracking acknowledgment failures/timeouts

Security Considerations 🔒

Positive:

  • Message size limits prevent memory exhaustion (MAX_PENDING_MSGS_SIZE_PER_REQ)
  • Timeout mechanisms prevent infinite waiting
  • No sensitive data exposure in error messages

Concern:

  • No rate limiting on acknowledgment messages (could be abused for DoS)

Performance Considerations ⚡

Positive:

  • Efficient hash map lookups for request tracking
  • Garbage collection runs at reasonable intervals (15s)

Concerns:

  1. Linear Search in GC: The GC loops through all pending messages every 15 seconds
    • For high-traffic scenarios, consider using a priority queue or time-indexed structure
  2. Lock Contention: HashMap access in shared_state.rs could become a bottleneck under high load
    • The scc::HashMap should handle this, but worth profiling

Recommendations Summary

Must Fix Before Merge:

  1. ⚠️ Implement acknowledgment handling to remove messages from pending_msgs
  2. ⚠️ Align timeout values or document the reasoning

Should Consider:

  1. Add unit tests for acknowledgment flow
  2. Add metrics for ack timeouts and failures
  3. Improve inline documentation for hibernation state
  4. Consider performance profiling for high-traffic scenarios

Nice to Have:

  1. Add rate limiting for tunnel messages
  2. Extract hibernation logic into separate methods
  3. Add error handling in TypeScript close callback

Conclusion

This is a solid implementation of WebSocket message acknowledgments with good error handling and protocol design. The main concern is the potential memory leak from unremoved acknowledged messages. Once that's addressed, this should be good to merge.

Overall Assessment:Approve with required changes

The architectural decisions are sound, and the implementation follows the codebase conventions well. Great work on maintaining backward compatibility with versioned protocols!


Review generated by Claude Code following repository standards from CLAUDE.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants