Skip to content

Conversation

@chrisstaite
Copy link
Contributor

@chrisstaite chrisstaite commented Oct 14, 2025

Description

There are a number of different semaphores in the system, for example the file open semaphore and the GCS connections semaphore. When the fast-slow store interacts between these then it can cause deadlocks.

Synchronising the collection of semaphores throughout the system is incredibly hard with the interfaces that are in place. Therefore we don't even try to.

Instead, add a check for the reader and writer being started for both sides of the fast-slow store and time out if they aren't. This should catch deadlocks and kick the system back to life as a watchdog timer. It's not the best solution, but it's something for now.

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Tests are still passing, not tried the code out in production yet.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@palfrey
Copy link
Member

palfrey commented Oct 15, 2025

I'm trying to understand what the improvements here are. I'm seeing a lot of replacement of with_connection in GCS client with something that does a very similar Semaphore call, but replaces .acquire() with .clone().acquire_owned() which doesn't feel like an improvement?

There's then what feels like a separate chunk of changes in fast_slow_store.rs which feels like it's the core of the changes here, but lacking in any tests around it. It looks ok at a first glance, but I'd love to see some tests demoing the sorts of problems it's meant to solve?

There's also the GCS store changes, which look like a fairly sensible set of "lets do the once a loop stuff once before the loop" changes.

This feels like at least two PRs, maybe three? Thoughts?

@chrisstaite
Copy link
Contributor Author

The issue with the GCS store is (as well as nested semaphore use) it doesn't hold a semaphore for the whole operation. Therefore detecting deadlocks externally is pretty much impossible compared to slow IO.

Once there's a single semaphore used we can add the watchdog timer to the fast-slow initial start to detect any deadlocks.

There are a number of different semaphores in the system, for example the file open semaphore and the GCS connections semaphore.  When the fast-slow store interacts between these then it can cause deadlocks.

Synchronising the collection of semaphores throughout the system is incredibly hard with the interfaces that are in place.  Therefore we don't even try to.

Instead, add a check for the reader and writer being started for both sides of the fast-slow store and time out if they aren't.  This should catch deadlocks and kick the system back to life as a watchdog timer.  It's not the best solution, but it's something for now.
@chrisstaite
Copy link
Contributor Author

Quite right that a test was needed. It was a little tricky to get working due to the triple buffering, but the test breaks on main and passes on this branch.

@chrisstaite-menlo
Copy link
Collaborator

I'm seeing this trigger a lot in my builds, so I'm not really happy with it as is. Perhaps we need to be more clever.

@chrisstaite-menlo
Copy link
Collaborator

Hmm... without it I'm seeing a lot of deadlocked workers which start up, get issued a few actions and then sit there idling with no logs at debug level. I think there might be an issue with the reqwest client retry logic used by gcloud-storage interacting with the retry logic in our layer on top.

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a complex one for me to review. I had only a few minor thoughts on it. I think we could potentially make this configurable in the future. It's the never ending challenge of trying to have generalizable piece of infrastructure but the use cases and scales are so different.

#[derive(Debug)]
pub struct UploadRef {
pub upload_ref: String,
pub(crate) _permit: OwnedSemaphorePermit,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is key because if means that connections are only held during active uploads, I think.


let (mut fast_tx, fast_rx) = make_buf_channel_pair();
let (slow_tx, mut slow_rx) = make_buf_channel_pair();
// There's a strong possibility of a deadlock here as we're working with multiple
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important fix.

@MarcusSorealheis
Copy link
Collaborator

I think there is still more work needed, actually.

@MarcusSorealheis
Copy link
Collaborator

MarcusSorealheis commented Oct 17, 2025

the key win is lifetime scoping: UploadRef bundles the permit directly with the upload session (the String URL), so it's dropped precisely when the resumable upload completes or aborts—not at the end of a broader with_connection block. This prevents holding GCS connections longer than needed during chunked uploads, which exacerbates semaphore contention in high-throughput scenarios (e.g., deadlocked workers). acquire_owned() on a clone ensures the permit is "owned" by the ref without blocking the shared semaphore pool prematurely.

There could definitely be a performance impact. We are seeing the idle problem with workers for some customers running hundreds of workers. GCS is not in the equation there, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants