[fix][test] Fix flaky PulsarFunctionsK8STest #25108
Merged
+9
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Test procedure:
There are two reasons for the test to fail:
1. Test message (step 3) may be published to function input topic before function subscribed to it
In step 2, we wait for the function to be created / running. However, if the function status indicates that there is a running instance, this does not necessarily mean that the function has already subscribed to the input topic. A function is marked as running once the function thread is alive, not once the function has subscribed to the input topic. If it takes some time for the function to start consuming, the message produced in step 3 may never be consumed by the function, and step 3 fails.
2. Waiting 30s for function to start again in step 7 is not enough
In both step 2 and step 7 we use the Pulsar admin to get the function status. Then, we wait for the number of running instances to become 1. We wait at most 30 seconds. In step 2, this takes ~10 seconds. In step 7 (without the additional
Thread.sleep(2000)), we run into the 30 second timeout before there is 1 running instance reported.The flow is as follows: Admin client in test sends a request to the Pulsar broker running in k3s. The
KubernetesRuntimerequests the function status from the function pod using the Kubernetes internal domain:pf-public-default-test-function-0.pf-public-default-test-function.default.svc.cluster.localTo analyze why the test fails, enable trace logs for "io.grpc", add the "jul-to-slf4j" dependency and the following code (for example, in
start()ofPulsarStandalone):In the broker logs, it can be seen that while waiting for the function pod to be created in step 2, in the beginning, the host
pf-public...svc.cluster.localcannot be resolved. Once the pod is started, it resolves to some IP like 10.42.0.6.If we extend the time to wait in step 7 to 40s, we can see the following:
The function pod domain still resolves to the IP of the old pod (10.42.0.6). 30s later, the function status request fails with "connection timed out after 30000 ms". In the next attempt, the function pod domain resolves to the IP of the new pod, and the function status can be requested successfully.
Modifications
Thread.sleep(2000)can be removed.Verifying this change
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: pdolif#20