feat: add ws msg ack to runner protocol #3275

MasterPtato · 2025-10-27T23:59:19Z

No description provided.

vercel · 2025-10-27T23:59:25Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Oct 31, 2025 1:15am

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Oct 31, 2025 1:15am
rivet-inspector	Ignored	Preview	Oct 31, 2025 1:15am
rivet-site	Ignored	Preview	Oct 31, 2025 1:15am

MasterPtato · 2025-10-27T23:59:34Z

feat: add ws msg ack to runner protocol #3275 : 2 dependent PRs (#3282 , #3288 ) 👈 (View in Graphite)
fix: improve reconnection with sleeping and tunnel #3261
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

pkg-pr-new · 2025-10-28T00:01:26Z

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3275

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3275

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3275

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3275

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3275

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3275

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3275

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3275

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3275

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3275

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3275

commit: 3d340c9

claude · 2025-10-28T00:04:07Z

PR Review: WebSocket Message Acknowledgment Feature

Summary

Overall Assessment: Approve with Minor Suggestions

This PR adds WebSocket message acknowledgment and hibernation support to the runner protocol. Well-implemented feature with solid error handling that follows codebase conventions.

Positive Aspects

Strong protocol versioning with v2.bare schema and wrap_latest() for backward compatibility
Robust Rust state management with hibernation, wrapping arithmetic for u16 indices, and GC
Good error handling with WebSocketServiceRetry and WebSocketServiceTimeout types
Memory safety with 1 MiB size limits preventing unbounded growth
Comprehensive TypeScript state tracking with proper lifecycle management
Batched acknowledgments reduce message overhead
Event buffering handles race conditions

Critical Issues

1. GC Logic Inverted (shared_state.rs:420-421, 428-429)

The retention logic is backwards - keeps timed-out connections and removes valid ones.

Current: keep = now.duration_since(send_instant) > TIMEOUT
Should be: keep = now.duration_since(send_instant) <= TIMEOUT

Impact: Critical - causes premature cleanup of valid connections.

2. Race Condition in Entry::Occupied (shared_state.rs:103-108)

Clearing pending_msgs while they may be in flight could cause message loss when old msg_tx channel is dropped.

Suggestion: Drain pending messages before clearing or document expected behavior.

3. Missing Index Validation (tunnel.ts:713)

_handleMessage passes msg.index without validation before reaching __ackWebsocketMessage bounds check.

Medium Priority Issues

Integer conversions use try_into()? but are unreachable due to line 173 check - use expect() to document
Debug logging with placeholders like "------------ ACK" should be removed (lines 374, 120-124, 355-360)
Hardcoded canHibernate: true needs configuration option (tunnel.ts:641)

Low Priority

Git dependencies for vbare (0.0.2 -> 0.0.3) and rivet-term repo change - intentional?
Index 0 skipping documented as TODO but not resolved (line 137)
Commented-out wrapping_lt function should be removed (lines 456-458)

Security & Performance

Security: Good practices - size limits prevent exhaustion, timeouts prevent leaks, no unvalidated input. Minor message replay concern.

Performance: Batched acks and efficient storage. Consider VecDeque for O(1) retention operations.

Test Coverage Gaps

Missing tests for:

Wrapping arithmetic edge cases (u16 wraparound)
Message redelivery scenarios
Hibernation state persistence across reconnections
GC timeout boundaries

Recommendations

Must Fix Before Merge:

Fix GC logic bug (lines 420-421, 428-429 in shared_state.rs)
Address Entry::Occupied race condition
Add index validation in TypeScript _handleMessage

Should Fix:

Remove debug logging statements
Document or fix index 0 skipping behavior

Nice to Have:

Configuration for canHibernate
VecDeque optimization for message queues
Integration tests for hibernation flow

Conclusion

Solid implementation of a complex distributed systems feature. The acknowledgment protocol is well-designed and the hibernation mechanism is creative. Critical bugs found are fixable with minimal changes. Will significantly improve WebSocket reliability once issues are addressed.

Recommendation: Approve with required fixes for GC logic and race condition.

Generated with Claude Code

claude · 2025-10-28T02:02:24Z

Pull Request Review: WebSocket Message Acknowledgment Protocol

Overview

This PR adds message acknowledgment support to the runner protocol for WebSocket connections, enabling reliable message delivery with hibernation/reconnection support. This is a substantial enhancement to the WebSocket tunneling infrastructure.

Code Quality & Architecture

✅ Strengths

Protocol Design: The message acknowledgment system using wrapping u16 indices is well-designed for handling wraparound gracefully with wrapping_gt() comparison logic.
Error Handling: Good separation of retry vs. terminal errors with new WebSocketServiceRetry and WebSocketServiceTimeout errors. The distinction between retryable failures and actual errors improves resilience.
Message Buffering: Smart hibernation state management with size limits (MAX_PENDING_MSGS_SIZE_PER_REQ) prevents unbounded memory growth.
TypeScript SDK: The WebSocketTunnelAdapter has good event buffering to handle race conditions where events fire before listeners are attached.

🔍 Potential Issues

Critical

Integer Overflow Risk (shared_state.rs:170-173)

hs.total_pending_ws_msgs_size += message_serialized.len() as u64;

if hs.total_pending_ws_msgs_size > MAX_PENDING_MSGS_SIZE_PER_REQ
	|| hs.pending_ws_msgs.len() >= u16::MAX as usize

The size tracking never decreases when messages are acknowledged. This will cause false positives and connection drops after sending 1MB total data, even if most was already acked.

Fix: Decrement total_pending_ws_msgs_size in ack_pending_websocket_messages():

pub async fn ack_pending_websocket_messages(...) -> Result<()> {
    // ... existing code ...
    hs.pending_ws_msgs.retain(|msg| {
        let keep = wrapping_gt(msg_index, ack_index);
        if !keep {
            hs.total_pending_ws_msgs_size -= msg.payload.len() as u64;
        }
        keep
    });
}

Unused Deserialization (shared_state.rs:336-338)

let de = protocol::versioned::ToClient::deserialize_with_embedded_version(
    &pending_msg.payload,
);

This deserialization result de is never used. Either remove it or add error handling.

Logic Inversion Bug (shared_state.rs:424-426)

keep = now.duration_since(earliest_pending_msg.send_instant)
    > MESSAGE_ACK_TIMEOUT;

This logic is inverted - it will keep requests that have timed out and drop ones that haven't. Should be < not >.

High Priority

TODO Comment (shared_state.rs:137)

// TODO: This ends up skipping 0 as an index when initiated but whatever

Skipping index 0 could cause off-by-one issues in client tracking. This should be properly fixed or documented why it's acceptable.

Missing Size Update (shared_state.rs:352-363)
When retaining messages based on ack, you're not updating total_pending_ws_msgs_size. Same issue as [SVC-2555] Set up issue templates #1 but in a different code path.
Race Condition (lib.rs:103-108)

Entry::Occupied(mut entry) => {
    entry.receiver_subject = receiver_subject;
    entry.msg_tx = msg_tx;
    entry.opened = false;
    entry.pending_msgs.clear();
}

When reusing an existing request entry, hibernation_state is not cleared. This could cause stale hibernation data to persist across reconnections.

Medium Priority

Comment Accuracy (lib.rs:318)

// NOTE: This gets set in shared_state.ts

The file is actually shared_state.rs, not .ts. Minor typo but could confuse developers.

Unused Function (shared_state.rs:461-463)

// fn wrapping_lt(a: u16, b: u16) -> bool {
//     b.wrapping_sub(a) < u16::MAX / 2
// }

Either remove commented code or add a comment explaining why it's kept for future use.

Error Construction (lib.rs:175)

return Err(WebsocketPendingLimitReached {}.build());

The struct definition was changed to have no fields, but this could be more informative by including actual limit values for debugging.

Typo in Comment (proxy_service.rs:1834-1835)

// Denotes that the connection did not fail, but needs to be retried to
// resole a new target

"resole" should be "resolve".

🔒 Security Concerns

Memory Exhaustion: While there's a 1MB limit per request, with many concurrent WebSocket connections, total memory could still grow large. Consider adding global limits or connection count limits.
Message Replay: The protocol allows resending messages based on last_msg_index, but there's no validation that the client isn't maliciously requesting replays of very old messages.

⚡ Performance Considerations

Cloning Serialized Messages (shared_state.rs:181)

payload: message_serialized.clone(),

Every WebSocket message is cloned for hibernation storage. For high-throughput connections, this adds significant overhead. Consider using Arc<Vec<u8>> for the payload.

Linear Search in GC (shared_state.rs:416-444)
The garbage collection scans all in-flight requests every 15 seconds. With many connections, this could become expensive. Consider using a priority queue or heap for timeout tracking.
Blocking Async Operations: Multiple .get_async() calls that immediately mutate could cause contention. Consider using entry_async() pattern more consistently.

🧪 Test Coverage

Missing Tests:

No unit tests for wrapping_gt() edge cases (wraparound scenarios)
No tests for hibernation state management
No tests for message replay logic
No tests for size limit enforcement
No integration tests for reconnection scenarios

Recommendation: Add tests for:

#[cfg(test)]
mod tests {
    #[test]
    fn test_wrapping_gt_wraparound() {
        assert!(wrapping_gt(1, u16::MAX));
        assert!(wrapping_gt(u16::MAX/2, 0));
        assert!(!wrapping_gt(u16::MAX, 1));
    }
    
    // Add tests for ack logic, hibernation, etc.
}

📋 Additional Notes

Breaking Changes: The protocol version bump and schema changes are appropriately versioned.
Dependency Update: The vbare dependency is now pulled from git instead of crates.io. Ensure this doesn't break builds for contributors without git access to that repo.
Missing Newline: guard.websocket_pending_limit_reached.json and new error files are missing trailing newlines (minor).

Summary

This is a well-architected feature that significantly improves WebSocket reliability. However, there are several critical bugs that need to be fixed before merging:

Must Fix:

Issue [SVC-2555] Set up issue templates #1: Size tracking memory leak
Issue [SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3: GC logic inversion
Issue Remove servers not in salt #6: Hibernation state not cleared on reconnection

Should Fix:

Issues [SVC-2479] Send cluster events to PostHog #2, [SVC-2483] Remove hardcoded uses of rivet.gg #4, [SVC-2358] Enable Redis memory overcommit #5, [SVC-2404] Improve internal documentation #7-[SVC-2479] Send cluster events to PostHog #10

Consider:

Add test coverage
Performance optimizations for high-throughput scenarios
Document the protocol changes and migration guide

Overall: Approve with required changes ✅ (pending fixes for critical issues)

Great work on this complex feature! The acknowledgment protocol design is solid, just needs the implementation bugs ironed out.

claude · 2025-10-29T01:55:10Z

Code Review: WebSocket Message Acknowledgment System

I've completed a thorough review of PR #3275 which adds WebSocket message acknowledgment to the runner protocol. Here's my comprehensive feedback:

Overview

This PR implements a sophisticated message acknowledgment system for WebSocket communications between the gateway and runners, enabling reliable message delivery with hibernation support. The implementation appears solid overall with good error handling patterns.

🟢 Strengths

1. Protocol Design

Clean protocol extension with v2 schema that maintains backward compatibility
Well-structured message types: TunnelAck, ToServerWebSocketMessageAck
Smart use of message indices (u16) with wrapping arithmetic for hibernation scenarios

2. Hibernation State Management

Excellent HibernationState tracking with pending message queues
Proper bounds checking on pending messages (1 MiB limit, u16::MAX count)
Message resend logic correctly handles wrapping indices with wrapping_gt() helper

3. Error Handling

Good use of custom error types following the codebase patterns:
- WebSocketServiceRetry - signals retry without counting as attempt
- WebSocketServiceTimeout - explicit timeout handling
- WebsocketPendingLimitReached - backpressure protection
Proper error propagation through the system

4. Timeout Management

MESSAGE_ACK_TIMEOUT (30s) and GC_INTERVAL (15s) are reasonable defaults
Garbage collection properly cleans up stale in-flight requests
Timeout handling in TunnelMessageData::Timeout variant

🟡 Issues & Concerns

High Priority

1. Logic Bug in GC Retention (engine/packages/pegboard-gateway/src/shared_state.rs:420-422)

let mut keep = true;

if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    keep = now.duration_since(earliest_pending_msg.send_instant)
        > MESSAGE_ACK_TIMEOUT;  // ❌ Should be: < MESSAGE_ACK_TIMEOUT or use <
}

Problem: The condition is inverted. Currently it keeps requests that have exceeded the timeout and removes those that haven't, which is backwards.

Expected: Keep requests where the oldest pending message is within the timeout window.

Fix:

keep = now.duration_since(earliest_pending_msg.send_instant) < MESSAGE_ACK_TIMEOUT;

The same issue exists at line 429 for hibernation state.

2. Potential Index Skip Issue (engine/packages/pegboard-gateway/src/shared_state.rs:137)

// TODO: This ends up skipping 0 as an index when initiated but whatever
msg.index = hs.last_ws_msg_index.wrapping_add(1);

The TODO suggests index 0 is skipped on init. This could cause issues:

If client expects indices starting at 0, there will be a mismatch
Could complicate hibernation recovery if first message is lost
The comment "but whatever" suggests this wasn't fully considered

Recommendation: Either properly initialize to start at 0, or document why skipping 0 is acceptable.

Medium Priority

3. Commented Debug Code (multiple files)

// fn wrapping_lt(a: u16, b: u16) -> bool {
//     b.wrapping_sub(a) < u16::MAX / 2
// }

Dead code should be removed rather than commented out.

4. Inconsistent Error Variable Naming (engine/packages/pegboard-gateway/src/shared_state.rs:230-235)

The ack handling has this warning:

if prev_len == in_flight.pending_msgs.len() {
    tracing::warn!("pending message does not exist or ack received after message body")
}

This suggests a potential race condition. Consider:

Is it valid to receive an ack for a message we don't track?
Should this be an error rather than a warning?
Document the expected flow to clarify when this is benign vs problematic

5. Missing Index Validation

When acking messages via ack_pending_websocket_messages, there's no validation that ack_index is reasonable. Consider adding bounds checking to detect protocol errors early.

Low Priority

6. Hardcoded Limits (engine/packages/pegboard-gateway/src/shared_state.rs:18)

const MAX_PENDING_MSGS_SIZE_PER_REQ: u64 = util::size::mebibytes(1);

Consider making this configurable per-actor or per-runner for different use cases (e.g., high-throughput actors might need more buffer).

7. Tracing Improvements

Several trace points could benefit from structured fields:

// Current
tracing::debug!("gateway waiting for response from tunnel");

// Better
tracing::debug!(request_id=?Uuid::from_bytes(request_id), "gateway waiting for response from tunnel");

This would help with debugging production issues.

📋 Code Quality

Style & Conventions ✅

Follows Rust formatting with hard tabs (per rustfmt.toml)
Imports properly organized at file top (not inline)
Uses anyhow::* for error handling as per guidelines
Proper use of workspace dependencies

Logging Patterns ✅

Generally good use of structured logging with tracing
Most log messages are lowercase as per conventions
Uses ?field notation for debug formatting

Error Patterns ✅

Proper #[derive(RivetError)] usage
Error groups ("guard", "ws") follow domain organization
Includes Serialize, Deserialize for errors with metadata

🔒 Security Considerations

✅ Good

Proper bounds checking on message queue sizes prevents memory exhaustion
Timeout mechanisms prevent infinite waits
Rate limiting preserved from existing proxy implementation

⚠️ Consider

DoS potential: A malicious runner could send many messages without acks to fill buffers
- Mitigation: The 1 MiB limit helps, but consider per-runner global limits
Audit trail: Consider logging when hibernation limits are hit for monitoring

🧪 Testing Recommendations

Unit Tests Needed:
- Wrapping arithmetic in wrapping_gt() with edge cases (0, u16::MAX, wrap-around)
- Message ack retention logic with various index sequences
- GC behavior with timeout edge cases
Integration Tests:
- Hibernation cycle: send messages → hibernate → wake → verify resend
- Network partition simulation: delay acks beyond timeout
- High message volume to test backpressure limits
Consider adding:
- Proptest/quickcheck for wrapping arithmetic invariants
- Load tests for concurrent websocket connections

📝 Minor Nitpicks

Cargo.toml version changes: The version bump from 25.8.2 to 2.0.22-rc.1 looks like a major version regression. Verify this is intentional (possibly a rebranding/renumbering?).
vbare dependency: Changed from crates.io to git with specific rev. Ensure this is necessary and the dependency is stable.
Grafana dashboard changes: The SQL query updates use IN () instead of $__conditionalAll(). Verify this maintains the same filtering behavior.

✅ Recommendations Summary

Must Fix:

Fix GC retention logic inversion (lines 420-422, 429-430 in shared_state.rs)
Resolve or properly document the index-0 skipping issue

Should Fix:
3. Remove commented dead code
4. Add validation for ack indices
5. Document/improve the "ack after message" warning case

Consider:
6. Make buffer limits configurable
7. Add comprehensive test coverage
8. Enhance structured logging
9. Document hibernation protocol flow

Overall Assessment

This is a well-designed feature that adds important reliability to WebSocket communications. The core architecture is sound, but there's a critical logic bug in the GC code that needs fixing before merge. The hibernation mechanism is clever and the error handling is comprehensive.

With the fixes above, this PR should provide robust message delivery with minimal performance overhead.

Recommended action: Request changes for the GC retention logic bug, then approve after fix.

Generated with Claude Code

claude · 2025-10-29T01:55:14Z

Pull Request Review: WebSocket Message Acknowledgment Protocol

This PR adds a critical reliability feature to the runner protocol - WebSocket message acknowledgments for guaranteed message delivery during actor hibernation. The implementation is generally solid with good attention to reliability patterns. Here are my findings:

✅ Strengths

Protocol Design

Well-structured protocol extension: The addition of TunnelAck, ToServerWebSocketMessageAck, and message indexing (index: u16) to ToClientWebSocketMessage is clean and follows the existing pattern
Backward compatibility: Changes to the BARE schema (v2.bare) are additive and don't break existing implementations
Proper message sequencing: Using wrapping arithmetic (wrapping_add, wrapping_sub) for u16 indices correctly handles overflow at 65,535 messages

Implementation Quality

Comprehensive error handling: New error types (WebSocketServiceRetry, WebSocketServiceTimeout) provide clear feedback
Memory safety: The pending message limit (MAX_PENDING_MSGS_SIZE_PER_REQ = 1 MiB) prevents unbounded memory growth
Timeout management: MESSAGE_ACK_TIMEOUT = 30s and TUNNEL_ACK_TIMEOUT = 2s provide reasonable bounds
Good separation of concerns: SharedState manages message lifecycle independently from gateway logic

Reliability Features

Message persistence: Pending messages are stored and replayed after hibernation (engine/packages/pegboard-gateway/src/shared_state.rs:312-369)
Garbage collection: Automatic cleanup of stale requests via gc() task (shared_state.rs:403-442)
Proper retry logic: WebSocket errors trigger route resolution retry with backoff (lib.rs:296, 300)

⚠️ Issues & Recommendations

🔴 Critical: Wrapping Comparison Logic

Location: engine/packages/pegboard-gateway/src/shared_state.rs:453-455

fn wrapping_gt(a: u16, b: u16) -> bool {
    a != b && a.wrapping_sub(b) < u16::MAX / 2
}

Issue: The wrapping comparison function has a potential edge case. When exactly at the midpoint (32,767 messages apart), the comparison becomes ambiguous.

Recommendation: Document this limitation clearly or consider using a more robust wrapping comparison:

fn wrapping_gt(a: u16, b: u16) -> bool {
    a != b && a.wrapping_sub(b) < (u16::MAX / 2) + 1
}

This ensures the comparison is consistent even at the boundary.

🟡 High: GC Logic Inverted

Location: engine/packages/pegboard-gateway/src/shared_state.rs:420-422

keep = now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT;

Issue: The garbage collection logic appears inverted. Currently it keeps requests when they've exceeded the timeout, but should remove them. This will cause timed-out requests to never be cleaned up.

Fix: Should be:

keep = now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT;

Same issue on line 429-430 for hibernated WebSocket messages.

🟡 Medium: Potential Message Loss on Reconnect

Location: engine/packages/pegboard-gateway/src/shared_state.rs:324-341

When resending pending WebSocket messages after hibernation, the code filters based on last_msg_index:

if last_msg_index < 0 || wrapping_gt(msg_index, last_msg_index.try_into()?) {
    self.ups.publish(&receiver_subject, &pending_msg.payload, PublishOpts::one()).await?;
}

Concern: If a message send fails here (network error, etc.), the error propagates but the pending messages aren't marked for retry. Consider wrapping this in a retry loop or at minimum logging which messages failed to resend.

🟡 Medium: Index Skipping in Hibernation

Location: engine/packages/pegboard-gateway/src/shared_state.rs:137-138

// TODO: This ends up skipping 0 as an index when initiated but whatever
msg.index = hs.last_ws_msg_index.wrapping_add(1);

Issue: This TODO indicates a known issue where index 0 is never used. While not breaking, this should be fixed for completeness:

// On initialization
req.hibernation_state = Some(HibernationState {
    total_pending_ws_msgs_size: 0,
    last_ws_msg_index: u16::MAX,  // Wraps to 0 on first message
    pending_ws_msgs: Vec::new(),
});

🟠 Low: TypeScript Error Handling

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:431-434

The _handleMessage function returns a boolean to indicate if the message was consumed, but this return value doesn't appear to be used anywhere. Consider:

Documenting what the return value means
Using it in the caller, or
Removing it if unnecessary

🟠 Low: Missing Bounds Check

Location: engine/packages/pegboard-gateway/src/shared_state.rs:173

if hs.total_pending_ws_msgs_size > MAX_PENDING_MSGS_SIZE_PER_REQ
    || hs.pending_ws_msgs.len() >= u16::MAX as usize
{
    return Err(WebsocketPendingLimitReached {}.build());
}

Suggestion: The check should be > u16::MAX as usize instead of >= since u16::MAX (65,535) is a valid count.

📋 Code Quality Observations

Good Practices ✅

Consistent error handling using the custom RivetError system
Proper use of structured logging with tracing (includes request_id, message_id)
Timeout guards on all async operations
Clear separation between gateway and runner concerns

Minor Improvements 🔧

Naming: WebsocketPendingLimitReached should be WebSocketPendingLimitReached (capital S) for consistency with other error names
Documentation: The protocol changes would benefit from a comment block explaining the ack flow
Testing: While I don't see test changes in this diff, ensure integration tests cover:
- Message replay after hibernation
- Proper ack handling under load
- Timeout and retry scenarios

🔒 Security Considerations

✅ Memory Limits: The 1 MiB limit per request prevents memory exhaustion attacks
✅ Timeout Guards: All operations have reasonable timeouts
✅ No Credential Exposure: The protocol changes don't introduce credential leakage risks
⚠️ DoS Consideration: A malicious client could potentially create many in-flight requests with pending messages. Consider adding a global limit on total pending message memory across all requests.

📊 Performance Notes

Memory overhead: Each pending message stores full serialized payload (~150-300 bytes typical). With 1 MiB limit, this is ~3,000-6,000 messages max per connection
GC frequency: 15-second intervals are reasonable for cleanup
Wrapping arithmetic: Efficient for index management, no performance concerns

🎯 Recommendations Summary

Before Merge:

🔴 Fix the GC logic inversion (lines 420-422, 429-430 in shared_state.rs)
🟡 Review and fix wrapping comparison edge case
🟡 Add error handling for message resend failures

Follow-up:

Fix index 0 skipping in hibernation initialization
Add integration tests for hibernation scenarios
Consider global memory limits across all connections
Document the ack protocol flow in code comments

Overall Assessment

This is a well-designed feature that significantly improves WebSocket reliability during actor hibernation. The core architecture is sound, but the GC logic bug needs fixing before merge. The wrapping arithmetic and message management are implemented correctly overall.

Recommendation: Request changes for the GC logic fix, then approve pending successful testing.

Reviewed with attention to CLAUDE.md conventions - structured logging, error patterns, and dependency management all follow project standards.

claude · 2025-10-29T20:32:05Z

Pull Request Review: WebSocket Message Acknowledgment Protocol

Overview

This PR introduces a WebSocket message acknowledgment system to the runner protocol (v2), enabling reliable message delivery with hibernation support. The implementation adds message indexing, acknowledgment tracking, and automatic retry/resend capabilities.

Positive Aspects

✅ Well-Designed Protocol

Clean separation of v1 and v2 protocols allows backward compatibility
Wrapping arithmetic (u16 index with wrapping_gt) correctly handles index overflow
Bounded message queue (1 MiB, 65535 messages) prevents unbounded memory growth

✅ Proper Error Handling

Three new error types properly categorized by retriability:
- WebSocketServiceUnavailable: Transient failures (retryable with backoff)
- WebSocketServiceRetry: Server-initiated retries (immediate, no backoff)
- WebSocketServiceTimeout: Ack timeout failures (non-retryable)
Error messages follow the project's RivetError conventions

✅ Comprehensive Implementation

Gateway-side tracking with HibernationState for pending messages
Client SDK integration with TypeScript runner
Retry logic with exponential backoff (3 attempts for transient failures)
Message replay on reconnection after hibernation

Issues & Concerns

🔴 Critical: Timeout Mismatch

Location: packages/pegboard-gateway/src/shared_state.rs:17 vs engine/sdks/typescript/runner/src/tunnel.ts:10

Gateway timeout: 30 seconds

const MESSAGE_ACK_TIMEOUT: Duration = Duration::from_secs(30);

Runner timeout: 5 seconds

const MESSAGE_ACK_TIMEOUT = 5000; // 5 seconds

Impact: The runner may close connections or retry 6x earlier than the gateway expects. This could lead to:

Unnecessary retries and wasted resources
False positive timeout errors on the runner side
Inconsistent behavior under load

Recommendation: Synchronize these timeouts. Consider making the runner timeout slightly shorter (e.g., 28s) to give it time to initiate cleanup before gateway timeout.

⚠️ High Priority: Missing Test Coverage

No tests found for:

Message acknowledgment flow (send → ack → removal from pending queue)
Index wrapping behavior (when index wraps from 65535 → 0)
Hibernation and message replay scenarios
Timeout and retry logic
Concurrent message sending/acking edge cases

Existing test: guard-core/tests/simple_websocket_test2.rs is a basic echo test and doesn't exercise the new ack mechanism.

Recommendation: Add integration tests covering:

#[tokio::test]
async fn test_websocket_message_ack_basic() { /* ... */ }

#[tokio::test]
async fn test_websocket_hibernation_replay() { /* ... */ }

#[tokio::test]
async fn test_websocket_index_wrapping() { /* ... */ }

#[tokio::test]
async fn test_websocket_ack_timeout() { /* ... */ }

#[tokio::test]
async fn test_pending_message_limit() { /* ... */ }

⚠️ Medium: Potential Data Loss on Acknowledgment

Location: packages/pegboard-gateway/src/shared_state.rs:387-398

The ack_pending_websocket_messages function removes messages from the pending queue without updating total_pending_ws_msgs_size:

hs.pending_ws_msgs.retain(|_| {
    let msg_index = hs.last_ws_msg_index
        .wrapping_sub(len)
        .wrapping_add(1)
        .wrapping_add(iter_index);
    let keep = wrapping_gt(msg_index, ack_index);
    iter_index += 1;
    keep
});

Issue: total_pending_ws_msgs_size is incremented when messages are added (line 170) but never decremented when acknowledged.

Impact:

total_pending_ws_msgs_size will grow indefinitely
Eventually hits MAX_PENDING_MSGS_SIZE_PER_REQ limit even if all messages are acknowledged
Causes WebsocketPendingLimitReached errors for active connections

Recommendation: Track size of removed messages and update the counter:

let mut removed_size = 0u64;
hs.pending_ws_msgs.retain(|msg| {
    let msg_index = /* calculation */;
    let keep = wrapping_gt(msg_index, ack_index);
    if \!keep {
        removed_size += msg.payload.len() as u64;
    }
    keep
});
hs.total_pending_ws_msgs_size = hs.total_pending_ws_msgs_size.saturating_sub(removed_size);

⚠️ Medium: Race Condition in Message Indexing

Location: packages/pegboard-gateway/src/shared_state.rs:134-143

msg.index = hs.last_ws_msg_index.wrapping_add(1);
// ... later ...
hs.last_ws_msg_index = ws_msg_index; // line 178

Issue: If multiple messages are sent concurrently (async context), last_ws_msg_index could be read by two tasks before either writes back, leading to duplicate indices.

Likelihood: Low (depends on concurrent access patterns), but possible with high-throughput WebSocket connections.

Recommendation:

Document that send_message should not be called concurrently for the same request
OR use atomic operations/locks if concurrent sends are expected
OR return the index from the send operation to make the read-modify-write atomic

🟡 Low: Index 0 Skipped on Initialization

Location: packages/pegboard-gateway/src/shared_state.rs:137

// TODO: This ends up skipping 0 as an index when initiated but whatever
msg.index = hs.last_ws_msg_index.wrapping_add(1);

Issue: The first message will have index 1, not 0 (since last_ws_msg_index starts at 0).

Impact: Low - wastes one index value but doesn't affect correctness due to wrapping arithmetic.

Recommendation: Either:

Initialize last_ws_msg_index to u16::MAX so first message gets index 0
Remove the TODO and document this is intentional
Keep as-is if the skip is acceptable

🟡 Low: Incomplete Size Accounting in Resend

Location: packages/pegboard-gateway/src/shared_state.rs:343-364

Similar to the ack function, resend_pending_websocket_messages retains messages (line 348) but doesn't update total_pending_ws_msgs_size.

Same fix needed as for ack_pending_websocket_messages.

Code Quality

✅ Good Practices

Follows CLAUDE.md conventions (error macros, imports, logging)
Proper use of structured logging with tracing
Uses workspace dependencies correctly
Added tokio-tungstenite to guard package appropriately

⚠️ Areas for Improvement

Missing inline documentation for complex logic (wrapping arithmetic, retention algorithms)
No unit tests for wrapping_gt helper function
Some magic numbers (30s timeout, 1 MiB limit) could be configurable

Performance Considerations

✅ Efficient Design

O(1) message sending with index increment
Bounded memory usage per connection
Efficient retain operation for cleanup

⚠️ Potential Concerns

O(n) retain operations on every ack could impact high-throughput connections
- Consider batch acking or deferred cleanup
Message serialization duplication: Messages are serialized and stored even if client may not need replay
- Could optimize for non-hibernation case
GC interval (15s) might be too aggressive for checking timeouts
- Consider tuning based on load testing

Security Concerns

✅ Good Security Practices

Bounded message queue prevents memory exhaustion attacks
Timeout mechanisms prevent indefinite resource holding
No obvious injection vulnerabilities

🟢 Minor Suggestions

Consider logging/metrics for pending message size approaching limits (DoS detection)
Document maximum memory usage per connection for capacity planning

Dependency Changes

⚠️ Git Dependency Concern

Changes:

[workspace.dependencies.vbare]
-version = "0.0.2"
+git = "https://github.com/rivet-dev/vbare"
+rev = "3ae474a0234801bb96d70bec4eddd4f2d640971e"

Issue: Switching from crates.io version to git dependency.

Questions:

Why is this change needed for the ack feature?
Are there breaking changes in vbare that haven't been published?
Should this be a separate PR since it's unrelated to ack functionality?

Recommendation: Either:

Explain the necessity in PR description
Split into separate PR if unrelated
Publish vbare 0.0.3 to crates.io if stable

Similar concern for rivet-term org change (rivet-gg → rivet-dev).

Miscellaneous

📝 Non-functional Changes

Version bumps (25.8.2 → 2.0.22-rc.1) across all packages - appears to be release preparation
Grafana dashboard query improvements (better ClickHouse query syntax)
Error artifact JSON formatting (minor whitespace)

These are fine but make the PR large. Consider separating version bumps and tooling improvements from feature PRs in the future for easier review.

Summary & Recommendations

Before Merging:

✅ MUST FIX: Synchronize timeout values between gateway (30s) and runner (5s)
✅ MUST FIX: Update total_pending_ws_msgs_size when messages are acknowledged/removed
✅ SHOULD ADD: Integration tests for ack flow, hibernation, and edge cases
✅ SHOULD REVIEW: Concurrent access safety for message indexing
✅ SHOULD CLARIFY: vbare and rivet-term dependency changes

Post-Merge:

Monitor memory usage and ack timing in production
Consider performance optimization for high-throughput scenarios
Add metrics for pending message queue sizes and ack latency

Overall Assessment

Score: 7/10

This is a well-architected feature with good error handling and retry logic. The protocol design is sound, and the implementation follows project conventions. However, the timeout mismatch and missing size accounting are critical bugs that must be fixed before merge. Test coverage is insufficient for a reliability-critical feature like message acknowledgment.

Recommendation: REQUEST CHANGES

Fix critical bugs (timeouts, size accounting)
Add integration tests
Clarify dependency changes

Once addressed, this will be a solid addition to enable reliable WebSocket communication with hibernation support.

Great work on tackling a complex distributed systems problem! The foundation is strong - just needs some polish before production. 🚀

claude · 2025-10-29T20:34:48Z

Pull Request Review: WebSocket Message Acknowledgment

Overview

This PR adds WebSocket message acknowledgment to the runner protocol (v2), enabling reliable message delivery with hibernation support for actors. The changes are substantial but well-structured.

📋 Summary of Changes

Runner Protocol v2: New schema with message acknowledgment primitives
Pegboard Gateway: Comprehensive message tracking and resend logic
Guard Core: New error types for WebSocket retry scenarios
Epoxy: New purger workflow for KV cleanup
TypeScript SDK: Updated to handle new protocol features

✅ Strengths

1. Well-Designed Protocol

The new v2 protocol schema is clean and comprehensive:

TunnelAck primitive for bidirectional acknowledgments
ToServerWebSocketMessageAck for explicit ack messages
lastMsgIndex tracking for hibernation recovery
Proper separation of concerns between message types

2. Robust Message Tracking

The SharedState implementation in pegboard-gateway/shared_state.rs is well thought out:

Pending message tracking with timeouts (30s)
Hibernation state with message buffering
Wrapping index arithmetic for u16 overflow handling
Size limits to prevent memory exhaustion (MAX_PENDING_MSGS_SIZE_PER_REQ)

3. Error Handling

Good addition of specific error types:

WebSocketServiceRetry - for hibernation reconnects
WebSocketServiceTimeout - for timeout scenarios
WebsocketPendingLimitReached - for backpressure

4. Graceful Degradation

The retry logic with hibernation support allows actors to:

Reconnect after temporary disconnections
Resume from last acknowledged message
Handle network instability gracefully

⚠️ Issues & Concerns

1. Potential Race Condition in GC Logic (Minor)

Location: pegboard-gateway/src/shared_state.rs:420-423

if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    keep = now.duration_since(earliest_pending_msg.send_instant)
        > MESSAGE_ACK_TIMEOUT;
}

Issue: The keep logic appears inverted. If the duration exceeds the timeout, we're setting keep = true, which means we'll KEEP the request, but we should be removing it.

Suggestion: Should be:

keep = now.duration_since(earliest_pending_msg.send_instant)
    <= MESSAGE_ACK_TIMEOUT;

Or send a timeout message and remove it:

if now.duration_since(earliest_pending_msg.send_instant) > MESSAGE_ACK_TIMEOUT {
    keep = false;
}

2. Wrapping Index Arithmetic Needs Validation (Medium)

Location: pegboard-gateway/src/shared_state.rs:453-455

fn wrapping_gt(a: u16, b: u16) -> bool {
    a != b && a.wrapping_sub(b) < u16::MAX / 2
}

Issue: The wrapping comparison logic is clever but needs careful review:

This assumes sequence numbers wrap within the window of u16::MAX / 2
If messages are delayed by more than ~32K messages, this could fail
Consider documenting the assumptions and constraints

Suggestion: Add documentation explaining:

Maximum window size
What happens if messages arrive out of order beyond the window
Test cases for wraparound scenarios

3. Missing Backpressure on Receiver (Medium)

Location: pegboard-gateway/src/shared_state.rs:199-287

The receiver task processes messages but doesn't have explicit backpressure if the channel fills up. The channel is sized at 128:

let (msg_tx, msg_rx) = mpsc::channel(128);

Issue: If a consumer can't keep up, messages could be dropped silently.

Suggestion: Consider:

Monitoring channel capacity metrics
Logging when close to capacity
Implementing explicit backpressure signals

4. Hard-Coded Timeout Values (Minor)

Locations: Multiple files

const TUNNEL_ACK_TIMEOUT: Duration = Duration::from_secs(2);
const MESSAGE_ACK_TIMEOUT: Duration = Duration::from_secs(30);

Suggestion: Consider making these configurable per-actor or environment, especially for testing.

5. Incomplete Error Context (Minor)

Location: pegboard-gateway/src/shared_state.rs:123-125

let mut req = self
    .in_flight_requests
    .get_async(&request_id)
    .await
    .context("request not in flight")?;

Suggestion: Include the request_id in the error context for debugging:

.with_context(|| format!("request not in flight: {}", Uuid::from_bytes(request_id)))?;

6. Potential Memory Leak in Hibernation (Medium)

Location: pegboard-gateway/src/shared_state.rs:169-186

If a WebSocket connection never reconnects, pending messages will accumulate until:

The size limit is hit (MAX_PENDING_MSGS_SIZE_PER_REQ)
The GC timeout fires

Issue: The GC timeout for pending WS messages is checked (line 425-431), but if new messages keep arriving within the timeout window, they'll never be cleaned up.

Suggestion: Add:

Maximum age for hibernated connections
Metrics for hibernation duration
Explicit cleanup after X minutes of hibernation

7. Missing Test Coverage (High)

Observation: No new test files added for:

WebSocket acknowledgment flow
Hibernation message replay
Wrapping index arithmetic
Timeout scenarios
GC behavior

Suggestion: Add integration tests covering:

Normal ack flow
Hibernation and reconnect
Message replay with gaps
Index wraparound
Timeout and cleanup

🔒 Security Considerations

1. DoS via Message Accumulation

The size limit (MAX_PENDING_MSGS_SIZE_PER_REQ = 1 MiB) provides good protection, but consider:

Per-actor rate limiting on message count
Metrics for actors hitting limits frequently

2. No Authentication on Ack Messages

Ack messages don't appear to be authenticated beyond the request_id/message_id pairing. This is probably fine given the architecture but worth documenting.

🎯 Performance Considerations

1. Lock Contention

The HashMap from scc is good for concurrency, but operations like get_async still acquire locks. With high message rates, this could become a bottleneck.

Suggestion: Consider performance testing under load with:

Many concurrent WebSocket connections
High message rates
Frequent hibernation/reconnection

2. Memory Overhead

Each in-flight request maintains:

Pending message list
Hibernation state with buffered messages
Channel for incoming messages

Suggestion: Add metrics for:

Total in-flight requests
Average pending message count
Memory usage per request

📝 Code Quality

1. Good Logging

The logging throughout is excellent with structured fields and appropriate levels.

2. Consistent Error Handling

Good use of the custom error system with RivetError.

3. Documentation

Some complex logic (like wrapping arithmetic) needs more inline documentation.

🚀 Recommendations

Priority: High

Fix the GC keep logic (potential bug)
Add comprehensive test coverage
Address hibernation memory leak scenario

Priority: Medium

Document wrapping index assumptions
Add performance metrics
Improve error context with IDs
Test under load

Priority: Low

Make timeouts configurable
Add backpressure monitoring
Document security assumptions

✨ Overall Assessment

This is a well-designed feature that adds important reliability to the WebSocket communication layer. The implementation is generally solid with good separation of concerns and error handling.

The main concerns are:

Potential GC bug that needs immediate attention
Missing test coverage for complex message replay logic
Some edge cases around long-lived hibernation

Recommendation: Request fixes for the high-priority items before merging. The implementation quality is good, but the lack of tests for such critical functionality is a concern.

Review generated with Claude Code

claude · 2025-10-31T01:16:50Z

PR Review: WebSocket Message Acknowledgment Implementation

Overview

This PR adds a message acknowledgment mechanism to the runner protocol for WebSocket communications. The changes enable reliable message delivery tracking between the gateway and runners, which is critical for production reliability.

Positive Aspects ✅

1. Well-Structured Protocol Changes

The v2.bare schema additions are clean and minimal:
- ToServerWebSocketMessageAck with index tracking (line 260-262)
- TunnelAck void type for bidirectional acknowledgments (line 199)
Proper versioning maintained with new protocol schema

2. Comprehensive Error Handling

Added three new error types in guard-core/src/errors.rs:
- WebSocketServiceRetry - for retryable WebSocket failures
- WebSocketServiceTimeout - for timeout scenarios
- WebSocketServiceUnavailable - for service availability issues
Error artifacts properly generated with JSON definitions

3. Robust Acknowledgment Tracking

Gateway side (pegboard-gateway/shared_state.rs):
- Tracks pending messages with timestamps
- Implements timeout handling (30s MESSAGE_ACK_TIMEOUT)
- Garbage collection for stale messages (15s GC_INTERVAL)
- Per-request pending message limits (1 MiB MAX_PENDING_MSGS_SIZE_PER_REQ)
Client side (TypeScript SDK):
- 5s timeout for message acknowledgments
- Proper cleanup on timeout with request rejection

4. TypeScript SDK Improvements

WebSocketTunnelAdapter properly handles message indices
Added __closeWithRetry method for graceful retry handling
Event buffering mechanism prevents race conditions with listener attachment

Issues & Concerns ⚠️

1. Potential Memory Leak Risk 🔴 HIGH PRIORITY

Location: engine/packages/pegboard-gateway/src/shared_state.rs:156-186

The send_message function adds messages to pending_msgs but there's no code path shown that removes acknowledged messages from this vector. The GC loop only removes timed-out messages.

Issue:

req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

Recommendation:
Add an acknowledgment handler that removes messages from pending_msgs when acks are received:

// In the receiver function, when TunnelAck is received:
if let Some(req) = self.in_flight_requests.get_async(&request_id).await {
    req.pending_msgs.retain(|msg| msg.message_id \!= message_id);
}

2. Missing Null Safety Check 🟡 MEDIUM PRIORITY

Location: engine/packages/guard-core/src/proxy_service.rs:722

let status = match &res {
    Ok(resp) => resp.status().as_u16().to_string(),
    Err(_) => "error".to_string(),  // Could provide more detail
};

Recommendation: Consider logging the actual error for debugging while maintaining the metric label.

3. Inconsistent Timeout Values

Gateway expects acks within 30 seconds (MESSAGE_ACK_TIMEOUT = Duration::from_secs(30))
TypeScript client times out after 5 seconds (MESSAGE_ACK_TIMEOUT = 5000)

Concern: This mismatch could lead to client-side timeouts while the gateway still considers messages pending.

Recommendation: Align these values or add comments explaining the rationale for different timeouts.

4. Hibernation State Management Complexity

Location: engine/packages/pegboard-gateway/src/shared_state.rs:134-186

The hibernation state logic interleaves with acknowledgment tracking, making the code path complex:

Message indices are conditionally updated
Pending WS messages are tracked separately
Size limits are enforced

Recommendation:

Add more inline documentation explaining the hibernation flow
Consider extracting hibernation logic into a separate method for clarity

5. Error Propagation in WebSocket Close

Location: engine/sdks/typescript/runner/src/websocket-tunnel-adapter.ts:196

The closeInner function doesn't handle potential errors from #closeCallback:

closeInner(code?: number, reason?: string, retry: boolean = false): void {
    // ...
    this.#closeCallback(code, reason, retry);  // No error handling
    this.#readyState = 3; // CLOSED
    // ...
}

Recommendation: Wrap in try-catch or document that callbacks should not throw.

Code Quality & Best Practices ✅

Positive:

Follows CLAUDE.md conventions:
- Uses structured logging with tracing macros
- Proper timestamp naming (send_instant, not sendInstant)
- Lowercase log messages
- Uses anyhow::* for error handling
Dependency Management:
- Properly uses workspace dependencies
- Updated vbare to git dependency (rev: 3ae474a)
Testing Considerations:
- Changes are backward compatible (new v2 protocol schema)
- Existing tests should continue to pass

Areas for Improvement:

Missing Unit Tests: No new test files added for acknowledgment logic
Documentation: Limited inline documentation for complex state management
Metrics: No new metrics added for tracking acknowledgment failures/timeouts

Security Considerations 🔒

Positive:

Message size limits prevent memory exhaustion (MAX_PENDING_MSGS_SIZE_PER_REQ)
Timeout mechanisms prevent infinite waiting
No sensitive data exposure in error messages

Concern:

No rate limiting on acknowledgment messages (could be abused for DoS)

Performance Considerations ⚡

Positive:

Efficient hash map lookups for request tracking
Garbage collection runs at reasonable intervals (15s)

Concerns:

Linear Search in GC: The GC loops through all pending messages every 15 seconds
- For high-traffic scenarios, consider using a priority queue or time-indexed structure
Lock Contention: HashMap access in shared_state.rs could become a bottleneck under high load
- The scc::HashMap should handle this, but worth profiling

Recommendations Summary

Must Fix Before Merge:

⚠️ Implement acknowledgment handling to remove messages from pending_msgs
⚠️ Align timeout values or document the reasoning

Should Consider:

Add unit tests for acknowledgment flow
Add metrics for ack timeouts and failures
Improve inline documentation for hibernation state
Consider performance profiling for high-traffic scenarios

Nice to Have:

Add rate limiting for tunnel messages
Extract hibernation logic into separate methods
Add error handling in TypeScript close callback

Conclusion

This is a solid implementation of WebSocket message acknowledgments with good error handling and protocol design. The main concern is the potential memory leak from unremoved acknowledged messages. Once that's addressed, this should be good to merge.

Overall Assessment: ✅ Approve with required changes

The architectural decisions are sound, and the implementation follows the codebase conventions well. Great work on maintaining backward compatibility with versioned protocols!

Review generated by Claude Code following repository standards from CLAUDE.md

MasterPtato mentioned this pull request Oct 27, 2025

fix: improve reconnection with sleeping and tunnel #3261

Merged

MasterPtato force-pushed the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch from 3d340c9 to 7edb49a Compare October 28, 2025 02:00

MasterPtato force-pushed the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch from 7edb49a to e662b92 Compare October 29, 2025 01:53

MasterPtato force-pushed the 10-24-fix_improve_reconnection_with_sleeping_and_tunnel branch from 049fe1f to dafcccf Compare October 29, 2025 01:53

MasterPtato mentioned this pull request Oct 29, 2025

chore: do not merge: disable reschedule backoff #3282

Closed

MasterPtato marked this pull request as ready for review October 29, 2025 20:26

MasterPtato requested a review from NathanFlurry October 29, 2025 20:26

vercel bot had a problem deploying to Preview – rivetkit-serverless October 29, 2025 20:27 Failure

MasterPtato force-pushed the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch from d7f7fba to a7d0289 Compare October 29, 2025 20:32

vercel bot had a problem deploying to Preview – rivetkit-serverless October 29, 2025 20:33 Failure

This was referenced Oct 29, 2025

chore: remove headers through guard #3288

Merged

fix: integrate sentry with tracing and axum #3290

Merged

fix: add version rollback check #3291

Merged

Base automatically changed from 10-24-fix_improve_reconnection_with_sleeping_and_tunnel to main October 31, 2025 01:14

MasterPtato added 2 commits October 30, 2025 18:14

feat: add ws msg ack to runner protocol

c1553c6

chore: do not merge: disable reschedule backoff

a3e2e41

NathanFlurry force-pushed the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch from a7d0289 to a3e2e41 Compare October 31, 2025 01:14

NathanFlurry merged commit 79bee4d into main Oct 31, 2025
8 of 12 checks passed

NathanFlurry deleted the 10-27-feat_add_ws_msg_ack_to_runner_protocol branch October 31, 2025 01:15

vercel bot had a problem deploying to Preview – rivetkit-serverless October 31, 2025 01:15 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add ws msg ack to runner protocol #3275

feat: add ws msg ack to runner protocol #3275

MasterPtato commented Oct 27, 2025

Uh oh!

vercel bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Oct 27, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Oct 28, 2025

Uh oh!

claude bot commented Oct 28, 2025

Uh oh!

claude bot commented Oct 28, 2025

Uh oh!

claude bot commented Oct 29, 2025

Uh oh!

claude bot commented Oct 29, 2025

Uh oh!

claude bot commented Oct 29, 2025

Uh oh!

claude bot commented Oct 29, 2025

Uh oh!

Uh oh!

claude bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add ws msg ack to runner protocol #3275

feat: add ws msg ack to runner protocol #3275

Conversation

MasterPtato commented Oct 27, 2025

Uh oh!

vercel bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

pkg-pr-new bot commented Oct 28, 2025

Uh oh!

claude bot commented Oct 28, 2025

PR Review: WebSocket Message Acknowledgment Feature

Summary

Positive Aspects

Critical Issues

1. GC Logic Inverted (shared_state.rs:420-421, 428-429)

2. Race Condition in Entry::Occupied (shared_state.rs:103-108)

3. Missing Index Validation (tunnel.ts:713)

Medium Priority Issues

Low Priority

Security & Performance

Test Coverage Gaps

Recommendations

Must Fix Before Merge:

Should Fix:

Nice to Have:

Conclusion

Uh oh!

claude bot commented Oct 28, 2025

Pull Request Review: WebSocket Message Acknowledgment Protocol

Overview

Code Quality & Architecture

✅ Strengths

🔍 Potential Issues

Critical

High Priority

Medium Priority

🔒 Security Concerns

⚡ Performance Considerations

🧪 Test Coverage

📋 Additional Notes

Summary

Uh oh!

claude bot commented Oct 29, 2025

Code Review: WebSocket Message Acknowledgment System

Overview

🟢 Strengths

1. Protocol Design

2. Hibernation State Management

3. Error Handling

4. Timeout Management

🟡 Issues & Concerns

High Priority

1. Logic Bug in GC Retention (engine/packages/pegboard-gateway/src/shared_state.rs:420-422)

2. Potential Index Skip Issue (engine/packages/pegboard-gateway/src/shared_state.rs:137)

Medium Priority

3. Commented Debug Code (multiple files)

4. Inconsistent Error Variable Naming (engine/packages/pegboard-gateway/src/shared_state.rs:230-235)

5. Missing Index Validation

Low Priority

6. Hardcoded Limits (engine/packages/pegboard-gateway/src/shared_state.rs:18)

7. Tracing Improvements

📋 Code Quality

Style & Conventions ✅

Logging Patterns ✅

Error Patterns ✅

🔒 Security Considerations

✅ Good

⚠️ Consider

🧪 Testing Recommendations

📝 Minor Nitpicks

✅ Recommendations Summary

Overall Assessment

Uh oh!

claude bot commented Oct 29, 2025

Pull Request Review: WebSocket Message Acknowledgment Protocol

vercel bot commented Oct 27, 2025 •

edited

Loading

MasterPtato commented Oct 27, 2025 •

edited

Loading