[fix][test] Fix flaky PulsarFunctionsK8STest #25108

pdolif · 2025-12-23T16:09:44Z

Motivation

Test procedure:

Create function (ExclamationFunction)
Wait for function to be created (number of running instances equals 1)
Test function:
- Create consumer and subscribe to function output topic
- Send message to function input topic
- Expect consumer to receive message transformed by function
Stop function
Wait for function to be stopped (number of running instances equals 0)
Start function again
Wait for function to be created (number of running instances equals 1)
Delete function

There are two reasons for the test to fail:

1. Test message (step 3) may be published to function input topic before function subscribed to it

In step 2, we wait for the function to be created / running. However, if the function status indicates that there is a running instance, this does not necessarily mean that the function has already subscribed to the input topic. A function is marked as running once the function thread is alive, not once the function has subscribed to the input topic. If it takes some time for the function to start consuming, the message produced in step 3 may never be consumed by the function, and step 3 fails.

2. Waiting 30s for function to start again in step 7 is not enough

In both step 2 and step 7 we use the Pulsar admin to get the function status. Then, we wait for the number of running instances to become 1. We wait at most 30 seconds. In step 2, this takes ~10 seconds. In step 7 (without the additional Thread.sleep(2000)), we run into the 30 second timeout before there is 1 running instance reported.

The flow is as follows: Admin client in test sends a request to the Pulsar broker running in k3s. The KubernetesRuntime requests the function status from the function pod using the Kubernetes internal domain: pf-public-default-test-function-0.pf-public-default-test-function.default.svc.cluster.local

To analyze why the test fails, enable trace logs for "io.grpc", add the "jul-to-slf4j" dependency and the following code (for example, in start() of PulsarStandalone):

LogManager.getLogManager().reset();
SLF4JBridgeHandler.install();

Logger root = LogManager.getLogManager().getLogger("");
root.setLevel(Level.ALL);

In the broker logs, it can be seen that while waiting for the function pod to be created in step 2, in the beginning, the host pf-public...svc.cluster.local cannot be resolved. Once the pod is started, it resolves to some IP like 10.42.0.6.

If we extend the time to wait in step 7 to 40s, we can see the following:
The function pod domain still resolves to the IP of the old pod (10.42.0.6). 30s later, the function status request fails with "connection timed out after 30000 ms". In the next attempt, the function pod domain resolves to the IP of the new pod, and the function status can be requested successfully.

Modifications

To fix the first issue: After step 2, we can wait for the function to subscribe to the input topic. Then, we know the function is actually running and can be tested.
To fix the second issue: Increase the wait time in step 7 from 30s to 40s. The Thread.sleep(2000) can be removed.

Verifying this change

Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: pdolif#20

lhotari

LGTM, thanks for fixing, @pdolif

codecov-commenter · 2025-12-23T17:14:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.48%. Comparing base (4495525) to head (e028e3d).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #25108       +/-   ##
=============================================
+ Coverage     30.71%   74.48%   +43.76%     
- Complexity       51    34043    +33992     
=============================================
  Files          1840     1899       +59     
  Lines        145468   149655     +4187     
  Branches      16907    17393      +486     
=============================================
+ Hits          44684   111470    +66786     
+ Misses        93810    29304    -64506     
- Partials       6974     8881     +1907

Flag	Coverage Δ
inttests	`26.49% <ø> (+0.26%)`	⬆️
systests	`23.01% <ø> (+0.03%)`	⬆️
unittests	`74.00% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1494 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pdolif added 2 commits December 22, 2025 18:20

Wait for function to subscribe to input topic

e6361b9

Increase wait time for function to start a second time

e028e3d

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 23, 2025

lhotari approved these changes Dec 23, 2025

View reviewed changes

lhotari merged commit ff0d0eb into apache:master Dec 28, 2025
54 checks passed

lhotari added this to the 4.2.0 milestone Dec 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix][test] Fix flaky PulsarFunctionsK8STest #25108

[fix][test] Fix flaky PulsarFunctionsK8STest #25108

pdolif commented Dec 23, 2025

Uh oh!

lhotari left a comment

Uh oh!

codecov-commenter commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[fix][test] Fix flaky PulsarFunctionsK8STest #25108

[fix][test] Fix flaky PulsarFunctionsK8STest #25108

Conversation

pdolif commented Dec 23, 2025

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 23, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants