-
Notifications
You must be signed in to change notification settings - Fork 10.3k
watch: Fix the deadlock between in-process client-server due to cancellation storm #21065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…d sendLoop, similar to the server side Signed-off-by: Zhijun <dszhijun@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhijun42 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @zhijun42. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
Haven't dig deep, but let me list things I would look into when reviewing this code:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 19 files with indirect coverage changes @@ Coverage Diff @@
## main #21065 +/- ##
==========================================
+ Coverage 68.44% 68.66% +0.22%
==========================================
Files 429 429
Lines 35203 35349 +146
==========================================
+ Hits 24093 24271 +178
+ Misses 9709 9687 -22
+ Partials 1401 1391 -10 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
/retest |
Signed-off-by: Zhijun <dszhijun@gmail.com>
…comments Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
|
/retest |
|
@serathius Thanks for the comprehensive list!
In the previous progress issue, there was race between watch events and control responses, and wrong ordering of revisions could happen. That is fundamentally different from this issue, where client sending watch requests much faster than server's processing rate could cause deadlock. For the ease of reasoning, you can ignore the data responses channel in the server, and just focus on how the control channel is filled up and causes deadlock between client and server data exchange.
I assume you meant to say "watch requests" (I'm not dropping responses). You're right, I just realized that dropping create requests is incorrect and will leave the system in a broken state that it can't serve create requests anymore. I refactor my changes to have two separate send channels and then always keep the watch create requests. I also added an integration test
I'm not sure what kind of robustness test would bring additional value beyond the integration test I just added.
Good point! I did have to spend many days reading through the codebase to understand the watch implementation. Unfortunately, I couldn't find an effective way to refactor it. That being said, I did do my best to:
Downsides:
Alternatives:
Is this the only solution?
|
|
@zhijun42: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
And I see the test case |
Solves issues #18879 and #20716
This change refactors the client-side watch stream implementation to make sending watch requests (create / cancel / progress) fully asynchronous and safe from deadlock under the in-process client-server transport.
Instead of sending directly from the main
watchGRPCStream.runloop, watch requests are now enqueued (in a non-blocking way) into bounded buffered channelssendCreatecandsendBestEffortc, and handled by a dedicatedsendLoopgoroutine. I rename theserveWatchClientfunction asrecvLoop, and now this recvLoop/sendLoop design actually mirrors the server side implementation.The reason for having two channels of sending client requests is the following:
We have three types of watch requests: create / cancel / progress. A create request has to be delivered to and processed by the server, and the user client will just block until it receives created response. However, the
cancel / progress requests are best-effort. Such distinction matters because when the watch system is under high pressure, we need to keep the create requests in channel
sendCreatecwhile dropping the best-effort ones out of channelsendBestEffortc.At any moment, there'll be only one
sendLoopgoroutine running. When the connection is disrupted,recvLoopexits with error and the mainwatchGRPCStream.runloop will close thesendcqueue channel, wait forsendLoopto exit, and then spin up a new pair of recvLoop/sendLoop, which is the same as in the server side implementation.I added an integration test
TestV3WatchCancellationStormto show that the system works fine even when there are a lot of watch cancel and create requests.Once the previous PR #21064 that reproduces the deadlock issue is reviewed and merged, I'll rebase this one and update that reproduction test as well. I separated the PRs to make code review easier.