Fast-slow deadlock #1978

chrisstaite · 2025-10-14T21:22:46Z

Description

There are a number of different semaphores in the system, for example the file open semaphore and the GCS connections semaphore. When the fast-slow store interacts between these then it can cause deadlocks.

Synchronising the collection of semaphores throughout the system is incredibly hard with the interfaces that are in place. Therefore we don't even try to.

Instead, add a check for the reader and writer being started for both sides of the fast-slow store and time out if they aren't. This should catch deadlocks and kick the system back to life as a watchdog timer. It's not the best solution, but it's something for now.

Type of change

Please delete options that aren't relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Tests are still passing, not tried the code out in production yet.

Checklist

Updated documentation if needed
Tests added/amended
bazel test //... passes locally
PR is contained in a single commit, using git amend see some docs

This change is

palfrey · 2025-10-15T17:00:05Z

I'm trying to understand what the improvements here are. I'm seeing a lot of replacement of with_connection in GCS client with something that does a very similar Semaphore call, but replaces .acquire() with .clone().acquire_owned() which doesn't feel like an improvement?

There's then what feels like a separate chunk of changes in fast_slow_store.rs which feels like it's the core of the changes here, but lacking in any tests around it. It looks ok at a first glance, but I'd love to see some tests demoing the sorts of problems it's meant to solve?

There's also the GCS store changes, which look like a fairly sensible set of "lets do the once a loop stuff once before the loop" changes.

This feels like at least two PRs, maybe three? Thoughts?

chrisstaite · 2025-10-15T17:17:46Z

The issue with the GCS store is (as well as nested semaphore use) it doesn't hold a semaphore for the whole operation. Therefore detecting deadlocks externally is pretty much impossible compared to slow IO.

Once there's a single semaphore used we can add the watchdog timer to the fast-slow initial start to detect any deadlocks.

There are a number of different semaphores in the system, for example the file open semaphore and the GCS connections semaphore. When the fast-slow store interacts between these then it can cause deadlocks. Synchronising the collection of semaphores throughout the system is incredibly hard with the interfaces that are in place. Therefore we don't even try to. Instead, add a check for the reader and writer being started for both sides of the fast-slow store and time out if they aren't. This should catch deadlocks and kick the system back to life as a watchdog timer. It's not the best solution, but it's something for now.

chrisstaite · 2025-10-15T18:56:33Z

Quite right that a test was needed. It was a little tricky to get working due to the triple buffering, but the test breaks on main and passes on this branch.

chrisstaite-menlo · 2025-10-17T13:47:40Z

I'm seeing this trigger a lot in my builds, so I'm not really happy with it as is. Perhaps we need to be more clever.

chrisstaite-menlo · 2025-10-17T16:05:05Z

Hmm... without it I'm seeing a lot of deadlocked workers which start up, get issued a few actions and then sit there idling with no logs at debug level. I think there might be an issue with the reqwest client retry logic used by gcloud-storage interacting with the retry logic in our layer on top.

MarcusSorealheis

This was a complex one for me to review. I had only a few minor thoughts on it. I think we could potentially make this configurable in the future. It's the never ending challenge of trying to have generalizable piece of infrastructure but the use cases and scales are so different.

MarcusSorealheis · 2025-10-17T21:27:00Z

nativelink-store/src/gcs_client/client.rs

+#[derive(Debug)]
+pub struct UploadRef {
+    pub upload_ref: String,
+    pub(crate) _permit: OwnedSemaphorePermit,


This is key because if means that connections are only held during active uploads, I think.

MarcusSorealheis · 2025-10-17T21:31:45Z

nativelink-store/src/fast_slow_store.rs


-        let (mut fast_tx, fast_rx) = make_buf_channel_pair();
-        let (slow_tx, mut slow_rx) = make_buf_channel_pair();
+        // There's a strong possibility of a deadlock here as we're working with multiple


Important fix.

MarcusSorealheis · 2025-10-17T21:43:13Z

I think there is still more work needed, actually.

MarcusSorealheis · 2025-10-17T21:46:07Z

the key win is lifetime scoping: UploadRef bundles the permit directly with the upload session (the String URL), so it's dropped precisely when the resumable upload completes or aborts—not at the end of a broader with_connection block. This prevents holding GCS connections longer than needed during chunked uploads, which exacerbates semaphore contention in high-throughput scenarios (e.g., deadlocked workers). acquire_owned() on a clone ensures the permit is "owned" by the ref without blocking the shared semaphore pool prematurely.

There could definitely be a performance impact. We are seeing the idle problem with workers for some customers running hundreds of workers. GCS is not in the equation there, though.

chrisstaite had a problem deploying to production October 14, 2025 21:22 — with GitHub Actions Failure

chrisstaite temporarily deployed to production October 14, 2025 21:22 — with GitHub Actions Inactive

chrisstaite force-pushed the bugfix/Fast-slow-deadlock branch from ee33345 to 07a075f Compare October 15, 2025 14:26

chrisstaite temporarily deployed to production October 15, 2025 14:27 — with GitHub Actions Inactive

chrisstaite had a problem deploying to production October 15, 2025 14:27 — with GitHub Actions Error

chrisstaite force-pushed the bugfix/Fast-slow-deadlock branch from 07a075f to 7be818e Compare October 15, 2025 14:54

chrisstaite temporarily deployed to production October 15, 2025 14:56 — with GitHub Actions Inactive

chrisstaite force-pushed the bugfix/Fast-slow-deadlock branch from 7be818e to 776e950 Compare October 15, 2025 18:55

chrisstaite temporarily deployed to production October 15, 2025 18:55 — with GitHub Actions Inactive

MarcusSorealheis approved these changes Oct 17, 2025

View reviewed changes

Merge branch 'main' into bugfix/Fast-slow-deadlock

b33d96c

MarcusSorealheis temporarily deployed to production October 17, 2025 21:43 — with GitHub Actions Inactive

MarcusSorealheis had a problem deploying to production October 17, 2025 21:43 — with GitHub Actions Failure

chrisstaite-menlo closed this Oct 18, 2025

MarcusSorealheis reopened this Dec 3, 2025

Merge branch 'main' into bugfix/Fast-slow-deadlock

d32d5c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast-slow deadlock #1978

Fast-slow deadlock #1978

Uh oh!

chrisstaite commented Oct 14, 2025 •

edited by MarcusSorealheis

Loading

Uh oh!

palfrey commented Oct 15, 2025

Uh oh!

chrisstaite commented Oct 15, 2025

Uh oh!

chrisstaite commented Oct 15, 2025

Uh oh!

chrisstaite-menlo commented Oct 17, 2025

Uh oh!

chrisstaite-menlo commented Oct 17, 2025

Uh oh!

MarcusSorealheis left a comment

Uh oh!

MarcusSorealheis Oct 17, 2025

Uh oh!

MarcusSorealheis Oct 17, 2025

Uh oh!

MarcusSorealheis commented Oct 17, 2025

Uh oh!

MarcusSorealheis commented Oct 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fast-slow deadlock #1978

Are you sure you want to change the base?

Fast-slow deadlock #1978

Uh oh!

Conversation

chrisstaite commented Oct 14, 2025 • edited by MarcusSorealheis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Checklist

Uh oh!

palfrey commented Oct 15, 2025

Uh oh!

chrisstaite commented Oct 15, 2025

Uh oh!

chrisstaite commented Oct 15, 2025

Uh oh!

chrisstaite-menlo commented Oct 17, 2025

Uh oh!

chrisstaite-menlo commented Oct 17, 2025

Uh oh!

MarcusSorealheis left a comment

Choose a reason for hiding this comment

Uh oh!

MarcusSorealheis Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

MarcusSorealheis Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

MarcusSorealheis commented Oct 17, 2025

Uh oh!

MarcusSorealheis commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chrisstaite commented Oct 14, 2025 •

edited by MarcusSorealheis

Loading

MarcusSorealheis commented Oct 17, 2025 •

edited

Loading