-
Notifications
You must be signed in to change notification settings - Fork 204
Fast-slow deadlock #1978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fast-slow deadlock #1978
Conversation
ee33345 to
07a075f
Compare
07a075f to
7be818e
Compare
|
I'm trying to understand what the improvements here are. I'm seeing a lot of replacement of There's then what feels like a separate chunk of changes in There's also the GCS store changes, which look like a fairly sensible set of "lets do the once a loop stuff once before the loop" changes. This feels like at least two PRs, maybe three? Thoughts? |
|
The issue with the GCS store is (as well as nested semaphore use) it doesn't hold a semaphore for the whole operation. Therefore detecting deadlocks externally is pretty much impossible compared to slow IO. Once there's a single semaphore used we can add the watchdog timer to the fast-slow initial start to detect any deadlocks. |
There are a number of different semaphores in the system, for example the file open semaphore and the GCS connections semaphore. When the fast-slow store interacts between these then it can cause deadlocks. Synchronising the collection of semaphores throughout the system is incredibly hard with the interfaces that are in place. Therefore we don't even try to. Instead, add a check for the reader and writer being started for both sides of the fast-slow store and time out if they aren't. This should catch deadlocks and kick the system back to life as a watchdog timer. It's not the best solution, but it's something for now.
7be818e to
776e950
Compare
|
Quite right that a test was needed. It was a little tricky to get working due to the triple buffering, but the test breaks on main and passes on this branch. |
|
I'm seeing this trigger a lot in my builds, so I'm not really happy with it as is. Perhaps we need to be more clever. |
|
Hmm... without it I'm seeing a lot of deadlocked workers which start up, get issued a few actions and then sit there idling with no logs at debug level. I think there might be an issue with the reqwest client retry logic used by gcloud-storage interacting with the retry logic in our layer on top. |
MarcusSorealheis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a complex one for me to review. I had only a few minor thoughts on it. I think we could potentially make this configurable in the future. It's the never ending challenge of trying to have generalizable piece of infrastructure but the use cases and scales are so different.
| #[derive(Debug)] | ||
| pub struct UploadRef { | ||
| pub upload_ref: String, | ||
| pub(crate) _permit: OwnedSemaphorePermit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is key because if means that connections are only held during active uploads, I think.
|
|
||
| let (mut fast_tx, fast_rx) = make_buf_channel_pair(); | ||
| let (slow_tx, mut slow_rx) = make_buf_channel_pair(); | ||
| // There's a strong possibility of a deadlock here as we're working with multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important fix.
|
I think there is still more work needed, actually. |
|
the key win is lifetime scoping: There could definitely be a performance impact. We are seeing the idle problem with workers for some customers running hundreds of workers. GCS is not in the equation there, though. |
Description
There are a number of different semaphores in the system, for example the file open semaphore and the GCS connections semaphore. When the fast-slow store interacts between these then it can cause deadlocks.
Synchronising the collection of semaphores throughout the system is incredibly hard with the interfaces that are in place. Therefore we don't even try to.
Instead, add a check for the reader and writer being started for both sides of the fast-slow store and time out if they aren't. This should catch deadlocks and kick the system back to life as a watchdog timer. It's not the best solution, but it's something for now.
Type of change
Please delete options that aren't relevant.
How Has This Been Tested?
Tests are still passing, not tried the code out in production yet.
Checklist
bazel test //...passes locallygit amendsee some docsThis change is