Skip to content

Conversation

@pdolif
Copy link
Contributor

@pdolif pdolif commented Dec 23, 2025

Motivation

Test procedure:

  1. Create function (ExclamationFunction)
  2. Wait for function to be created (number of running instances equals 1)
  3. Test function:
    • Create consumer and subscribe to function output topic
    • Send message to function input topic
    • Expect consumer to receive message transformed by function
  4. Stop function
  5. Wait for function to be stopped (number of running instances equals 0)
  6. Start function again
  7. Wait for function to be created (number of running instances equals 1)
  8. Delete function

There are two reasons for the test to fail:

1. Test message (step 3) may be published to function input topic before function subscribed to it

In step 2, we wait for the function to be created / running. However, if the function status indicates that there is a running instance, this does not necessarily mean that the function has already subscribed to the input topic. A function is marked as running once the function thread is alive, not once the function has subscribed to the input topic. If it takes some time for the function to start consuming, the message produced in step 3 may never be consumed by the function, and step 3 fails.

2. Waiting 30s for function to start again in step 7 is not enough

In both step 2 and step 7 we use the Pulsar admin to get the function status. Then, we wait for the number of running instances to become 1. We wait at most 30 seconds. In step 2, this takes ~10 seconds. In step 7 (without the additional Thread.sleep(2000)), we run into the 30 second timeout before there is 1 running instance reported.

The flow is as follows: Admin client in test sends a request to the Pulsar broker running in k3s. The KubernetesRuntime requests the function status from the function pod using the Kubernetes internal domain: pf-public-default-test-function-0.pf-public-default-test-function.default.svc.cluster.local

To analyze why the test fails, enable trace logs for "io.grpc", add the "jul-to-slf4j" dependency and the following code (for example, in start() of PulsarStandalone):

LogManager.getLogManager().reset();
SLF4JBridgeHandler.install();

Logger root = LogManager.getLogManager().getLogger("");
root.setLevel(Level.ALL);

In the broker logs, it can be seen that while waiting for the function pod to be created in step 2, in the beginning, the host pf-public...svc.cluster.local cannot be resolved. Once the pod is started, it resolves to some IP like 10.42.0.6.

If we extend the time to wait in step 7 to 40s, we can see the following:
The function pod domain still resolves to the IP of the old pod (10.42.0.6). 30s later, the function status request fails with "connection timed out after 30000 ms". In the next attempt, the function pod domain resolves to the IP of the new pod, and the function status can be requested successfully.

Modifications

  • To fix the first issue: After step 2, we can wait for the function to subscribe to the input topic. Then, we know the function is actually running and can be tested.
  • To fix the second issue: Increase the wait time in step 7 from 30s to 40s. The Thread.sleep(2000) can be removed.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: pdolif#20

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 23, 2025
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing, @pdolif

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.48%. Comparing base (4495525) to head (e028e3d).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #25108       +/-   ##
=============================================
+ Coverage     30.71%   74.48%   +43.76%     
- Complexity       51    34043    +33992     
=============================================
  Files          1840     1899       +59     
  Lines        145468   149655     +4187     
  Branches      16907    17393      +486     
=============================================
+ Hits          44684   111470    +66786     
+ Misses        93810    29304    -64506     
- Partials       6974     8881     +1907     
Flag Coverage Δ
inttests 26.49% <ø> (+0.26%) ⬆️
systests 23.01% <ø> (+0.03%) ⬆️
unittests 74.00% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1494 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari lhotari merged commit ff0d0eb into apache:master Dec 28, 2025
54 checks passed
@lhotari lhotari added this to the 4.2.0 milestone Dec 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants