Skip to content

Conversation

@zhijun42
Copy link

@zhijun42 zhijun42 commented Dec 31, 2025

Reproduces issues #18879 and #20716

The next PR #21065 properly solves these issues.


These two issues have the same root cause, and it only shows up with the in-process client–server path, not with a regular gRPC connection. Over the network, HTTP/2 flow control and socket teardown naturally apply backpressure and reset state, but the in-process transport is using the chanStream in server/proxy/grpcproxy/adapter/chan_stream.go , which simply relies on go channels.

When we have a lot of watch cancel requests, we would easily run into deadlock. Refer to #20716 for a nice diagram analyzing the deadlock.

Conceptually, each watch stream should be owned by a single client: if we rerun the program, we should expect to start from a clean slate. Unfortunately, with the in-process design there are effectively two layers of clients:

  1. user's client program
  2. the long-lived in-process transport client that talks directly to the server - essentially a proxy sits between the user clients and the etcd server.

Rerunning the user program doesn’t reset the in-process client, so the connection between this in-process client and the server is still stuck in deadlock, and all subsequent user reruns would also be stuck (as long as these user programs rely on this in-process client, but notice that if you do etcdctl or other programs that can directly talk to the etcd server, you can still use the watch feature without problem).

This PR adds an integration test to reproduce this issue. It creates a lot of watchers, ensures they work fine at first, and then cancels most of these watchers. The remaining watchers stop receiving any watch events, and we will see non-zero etcd_debugging_mvcc_slow_watcher_total and etcd_debugging_mvcc_pending_events_total metrics.

Signed-off-by: Zhijun <dszhijun@gmail.com>
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhijun42
Once this PR has been reviewed and has the lgtm label, please assign siyuanfoundation for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @zhijun42. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zhijun42 zhijun42 changed the title Yeah a fully working version of reproduction test watch: Reproduce the deadlock between in-process client-server due to cancellation storm Dec 31, 2025
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants