Skip to content

feat(ups): implement queue subs#4486

Merged
NathanFlurry merged 1 commit intomainfrom
03-23-fix_ups_implement_queue_subs
Apr 5, 2026
Merged

feat(ups): implement queue subs#4486
NathanFlurry merged 1 commit intomainfrom
03-23-fix_ups_implement_queue_subs

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app bot commented Mar 24, 2026

🚅 Deployed to the rivet-pr-4486 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web Apr 5, 2026 at 11:23 am
website 😴 Sleeping (View Logs) Web Apr 5, 2026 at 12:09 am
frontend-cloud 😴 Sleeping (View Logs) Web Apr 2, 2026 at 4:26 am
frontend-inspector 😴 Sleeping (View Logs) Web Mar 24, 2026 at 7:28 am
mcp-hub ✅ Success (View Logs) Web Mar 24, 2026 at 12:31 am
ladle ❌ Build Failed (View Logs) Web Mar 24, 2026 at 12:31 am

@MasterPtato MasterPtato changed the title fix(ups): implement queue subs feat(ups): add queue subscription support Mar 24, 2026
@MasterPtato MasterPtato changed the title feat(ups): add queue subscription support feat(ups): add Mar 24, 2026
@MasterPtato MasterPtato changed the title feat(ups): add feat(ups): implement queue subs Mar 24, 2026
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 24, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4486

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4486

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4486

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4486

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4486

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4486

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4486

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4486

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4486

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4486

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4486

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4486

commit: ec32bdf

@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch from ec32bdf to ae0f886 Compare March 24, 2026 00:36
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 6570bf2 to b46226c Compare March 24, 2026 00:36
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 March 24, 2026 00:36 Destroyed
@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review: feat(ups): implement queue subs

The core design is solid. Using FOR UPDATE SKIP LOCKED for Postgres-backed competing consumers is the right approach, and the heartbeat/TTL pattern for publisher-side discovery is appropriate given Postgres NOTIFY's lack of built-in queue semantics. A few issues need attention before merging.

Critical

DefaultHasher is not stable across processes or Rust versions
(engine/packages/universalpubsub/src/driver/postgres/mod.rs)

DefaultHasher's output is explicitly not guaranteed to be deterministic across processes or Rust versions. Using it to compute subject_hash and queue_hash stored in ups_queue_subs / ups_queue_messages means two processes, or the same process after a Rust upgrade, may compute different hashes for the same string, silently breaking all queue routing. The existing hash_subject path has the same pre-existing bug, but extending it to persisted queue keys makes this a data-integrity issue.

Fix: Replace with a stable, deterministic hash (e.g., FNV-1a or truncated SHA-256 formatted as hex).

Medium

No metrics update in queue subscription cleanup task (spawn_queue_subscription_cleanup_task)

spawn_subscription_cleanup_task updates metrics::POSTGRES_SUBSCRIPTION_COUNT when removing entries. The queue analog does not update any counter after removing from queue_subscriptions. This is an observability gap.

Unresolved TODO: retain_sync inside retain_async (engine/packages/universalpubsub/src/driver/memory/mod.rs)

The TODO comment is left in the code. retain_sync inside the retain_async closure blocks the async executor thread for the duration of the inner lock. For a 60-second GC interval this is acceptable (no deadlock risk since the nested maps are distinct objects), but the TODO should be replaced with a comment explaining why it is safe.

No queue depth limit (engine/packages/universalpubsub/src/driver/postgres/mod.rs)

QUEUE_MESSAGE_MAX_AGE_SECS is 3600s and GC runs every 5 minutes. If a consumer is down, messages accumulate unbounded for up to an hour. High-throughput queues with dead consumers could grow the table very large. Worth noting as a design-level concern even if not a blocker.

Low

Idiomatic Rust: return Ok(...) vs expression form (engine/packages/universalpubsub/src/pubsub.rs:136)

return Ok(Subscriber::new(...)) should be the bare expression Ok(Subscriber::new(...)) to match the style of the surrounding subscribe method.

SQL injection pattern is fragile (engine/packages/universalpubsub/src/driver/postgres/mod.rs)

NOTIFY "{}" with channel built from format! is safe today only because the hash output is hex. The same pre-existing pattern is now copied into two additional call sites. A helper that validates the channel name before interpolating would prevent this from becoming a vulnerability if the input ever changes.

Unused anyhow! import (engine/packages/universalpubsub/src/driver/postgres/mod.rs:1)

If all usages of anyhow! were replaced by .context() per the project convention, remove anyhow from the import line.

Informational

  • Memory metric counts topics, not subscribers. MEMORY_SUBSCRIBER_COUNT uses self.subscribers.len() + self.queue_subscribers.len() which counts distinct topic entries, not individual sender handles. Pre-existing issue, but the surface is now expanded.
  • Test timing sensitivity. The load-balance test uses a 500ms collection window per subscriber (vs. 5s in test_queue_subscribe_single). This may be flaky on slow CI/Postgres environments with 10 messages across database round-trips.
  • NATS driver is minimal and correct - clean delegation to the NATS client.

Summary

Severity Issue
Critical DefaultHasher used for persistent DB keys - not stable across processes/versions
Medium No metrics update in queue subscription cleanup task
Medium retain_sync inside async retain - TODO unresolved
Medium No queue depth limit; unbounded growth if consumers are down
Low return Ok(...) vs idiomatic expression form
Low SQL NOTIFY injection pattern fragile (safe today, not future-proof)
Low Unused anyhow! import if all usages replaced
Info Memory metric counts topics not subscribers (expanded surface)
Info Load-balance test 500ms window may be flaky on slow CI

The DefaultHasher stability issue is the most important fix - it will silently corrupt queue routing in any multi-process deployment.

@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

Code Review summary for feat(ups): implement queue subs -- The implementation is well-structured and the test coverage is solid. Two medium issues to address before merge: (1) retain_sync on scc::HashMap inside the memory driver GC task runs synchronously inside an async closure and can block the executor; (2) The SELECT and INSERT+NOTIFY in publish_to_queues are not wrapped in a transaction, creating a race where a message can be orphaned for up to 1 hour if the subscriber dies between the two operations. Minor items: one remaining anyhow! macro call in publish() should use .context() instead; the queue parameter is missing from the tracing span in queue_subscribe (fields(%subject) should also include queue); and the Cargo.lock version downgrade from 2.1.7 to 2.1.6-rc.1 appears to be a stacked-PR artifact. On test coverage: test_queue_subscribe_load_balance only checks total message count, not per-subscriber distribution -- this is the right choice to avoid flakiness but worth noting in a comment. Also missing a test for Postgres reconnection behavior (does not drain pending ups_queue_messages on reconnect).

@claude
Copy link
Copy Markdown

claude bot commented Mar 24, 2026

PR Review: feat(ups): implement queue subs (#4486)

Good implementation of queue subscription (queue group) semantics across all three drivers. The overall design is solid — NATS delegates to the native implementation, memory uses random selection, and Postgres uses a durable table-backed approach with heartbeats. The three integration tests cover the key correctness properties.

A few issues worth noting:


Bug: Heartbeat task leaks on subscriber drop

engine/packages/universalpubsub/src/driver/postgres/mod.rs

let heartbeat_token = tokio_util::sync::CancellationToken::new();
let heartbeat_token_child = heartbeat_token.clone();
// ...
_heartbeat_token: heartbeat_token,

Dropping a CancellationToken does not cancel it — only calling .cancel() does. When PostgresQueueSubscriber is dropped, the _heartbeat_token field is dropped but the heartbeat task holds heartbeat_token_child and keeps running indefinitely. The Drop impl deletes the DB row, so the heartbeat's UPDATE silently no-ops, but the task and pool connections are wasted every 10 seconds.

Fix: use drop_guard() so cancel fires on drop:

let heartbeat_token = tokio_util::sync::CancellationToken::new();
let heartbeat_token_child = heartbeat_token.clone();
let _heartbeat_guard = heartbeat_token.drop_guard(); // cancel on drop
// store _heartbeat_guard on the struct instead of heartbeat_token

Test flakiness risk in test_queue_subscribe_load_balance

tokio::time::timeout(Duration::from_millis(500), sub1.next())

Using a 500ms timeout to collect messages is inherently flaky under slow CI or Postgres startup. A more robust approach is to collect exactly N messages with a longer per-message timeout:

for _ in 0..message_count {
    match tokio::time::timeout(Duration::from_secs(5), /* ... */).await { ... }
}

GC task not cancellable

The GC task spawned in new() has no shutdown mechanism and will run forever even after all references to the driver are dropped. This is a minor resource leak. Consider tying it to a CancellationToken stored in the driver (or using tokio_util::task::TaskTracker).


Memory driver metrics count topics, not subscribers

metrics::MEMORY_SUBSCRIBER_COUNT
    .set((inner.subscribers.len() + inner.queue_subscribers.len()) as i64);

Both .len() calls count distinct topics (outer map entries), not the total number of individual subscriber channels. This was already the case for regular subs, but worth tracking. The queue variant additionally undercounts since multiple queue groups per topic are flattened.


Minor: unnecessary return in pubsub.rs

return Ok(Subscriber::new(...));

The explicit return is not idiomatic Rust; just Ok(Subscriber::new(...)) is cleaner.


Noted — by design

  • Sequential insert+notify per queue group in publish_to_queues: Each group gets its own INSERT + NOTIFY in a loop, not a transaction. A partial failure leaves some groups notified and others not. Given the heartbeat-based fallback and GC, this is acceptable but worth a comment.
  • FOR UPDATE SKIP LOCKED CTE: Clean approach for at-most-once delivery. The Lagged broadcast error handling correctly loops back to claim, so no messages are silently dropped.
  • Reconnection support: Queue channels are correctly re-LISTENed alongside regular channels after reconnect.

Overall solid work. The heartbeat leak is the main issue worth a follow-up fix.

@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from b46226c to 66ec30f Compare March 25, 2026 00:05
@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch from ae0f886 to 4e8c22e Compare March 25, 2026 00:05
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 March 25, 2026 00:05 Destroyed
@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch 2 times, most recently from e5b3f53 to 6af4511 Compare March 26, 2026 20:50
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 97b9cfd to 3fc4f7f Compare March 26, 2026 20:50
@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch from 6af4511 to f975d35 Compare March 28, 2026 00:20
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 3fc4f7f to 662fee6 Compare March 28, 2026 00:20
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 March 28, 2026 00:20 Destroyed
@NathanFlurry NathanFlurry mentioned this pull request Mar 28, 2026
11 tasks
@MasterPtato MasterPtato mentioned this pull request Mar 31, 2026
11 tasks
@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch from 5113d59 to b0cc12a Compare March 31, 2026 22:24
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 715bec8 to 31da82c Compare March 31, 2026 22:24
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 March 31, 2026 22:24 Destroyed
@MasterPtato MasterPtato mentioned this pull request Mar 31, 2026
11 tasks
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 31da82c to 85f7553 Compare April 1, 2026 02:11
@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch from b0cc12a to 3d99705 Compare April 1, 2026 02:11
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 April 1, 2026 02:11 Destroyed
@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch from 3d99705 to 3fbc6dc Compare April 2, 2026 02:47
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 85f7553 to 626728f Compare April 2, 2026 02:47
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 April 2, 2026 02:47 Destroyed
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 626728f to 83ebe90 Compare April 3, 2026 01:24
@MasterPtato MasterPtato force-pushed the 03-23-fix_ups_implement_queue_subs branch from 3fbc6dc to 0126cf8 Compare April 3, 2026 01:24
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 April 3, 2026 01:24 Destroyed
@NathanFlurry NathanFlurry mentioned this pull request Apr 4, 2026
11 tasks
@NathanFlurry NathanFlurry marked this pull request as ready for review April 5, 2026 10:57
Copy link
Copy Markdown
Member

NathanFlurry commented Apr 5, 2026

Merge activity

  • Apr 5, 11:11 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 5, 11:23 AM UTC: Graphite rebased this pull request as part of a merge.
  • Apr 5, 11:23 AM UTC: @NathanFlurry merged this pull request with Graphite.

@NathanFlurry NathanFlurry changed the base branch from 03-19-feat_cache_add_in_flight_deduping to graphite-base/4486 April 5, 2026 11:20
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4486 to main April 5, 2026 11:21
@NathanFlurry NathanFlurry force-pushed the 03-23-fix_ups_implement_queue_subs branch from 0126cf8 to c732915 Compare April 5, 2026 11:22
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4486 April 5, 2026 11:22 Destroyed
@NathanFlurry NathanFlurry merged commit 83df281 into main Apr 5, 2026
10 of 13 checks passed
@NathanFlurry NathanFlurry deleted the 03-23-fix_ups_implement_queue_subs branch April 5, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants