[FLINK-38600] Fix a race condition in consumer creation by adding a retry with delay #112
+23
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose of the change
Fix the race condition issue that occasionally happens (about 1-2% probability per partition) because the connector is creating a dummy consumer to seek to the right cursor position, closes it and immediately after that creates the real consumer. It leads to a race condition where the previous consumer is not fully released on the broker side, and the broker responds with
Exclusive consumer is already connected, which leads to the job being restarted. In our case we were subscribing to thousands of topics, so the job would continuously restart for hours until it reaches an attempt where none of the topics hit this race condition.I believe this may be a regression from #59. The reason we have to create a separate consumer to seek is described in PIP-194. Basically it looks like there isn't a way to create a consumer with the cursor already set: if we create it and then call
seek, some messages may still leak through in between. Maybe StreamNative knows of another way, but it seems like PIP-194 is not adopted/implemented so we have to seek before creating the real consumer.Brief change log
Verifying this change
Please make sure both new and modified tests in this PR follows the conventions defined in our code quality
guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
Significant changes
(Please check any boxes [x] if the answer is "yes". You can first publish the PR and check them afterwards, for
convenience.)
@Public(Evolving))