Skip to content

Conversation

@TakaHiR07
Copy link
Contributor

Motivation

Occur a NPE issue in delay message. And then find the reason is in delayedMessagesCount. When InMemoryDelayedDeliveryTracker#addMessage(), it don't judge whether the entryId is exist in roaringbitmap, that result in the delayedMessagesCount of the map size is not correct.

企业微信截图_bb4f30f2-8829-413d-a7e8-e8746dc07adc

Modifications

  1. add test to test the duplicate entry case
  2. check whether roaring64Bitmap contains entryId
  3. log error for the case of "n < 0" in getScheduledMessages(), since this case should not occur

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

.computeIfAbsent(timestamp, k -> new Long2ObjectRBTreeMap<>())
.computeIfAbsent(ledgerId, k -> new Roaring64Bitmap());
if (!roaring64Bitmap.contains(entryId)) {
roaring64Bitmap.add(entryId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like .addLong should be used in the Roaring64Bitmap API.

Suggested change
roaring64Bitmap.add(entryId);
roaring64Bitmap.addLong(entryId);

The .add method works too, but the method signature takes a long array (long...). Perhaps the compiler is able to optimize that, so it might not make a difference.

It's unfortunate that Roaring64Bitmap doesn't have the checkedAdd method as there is in RoaringBitmap. That would eliminate the need for the .contains check.

Copy link
Contributor Author

@TakaHiR07 TakaHiR07 Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately.

"add" or "addLong" I guess is the same after compile, sure it may be better to use "addLong" directly.

Besides, I think there is no need to consider concurrent situation in InMemoryDelayedDeliveryTracker. I don't see any code point out that concurrent situation would occur.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threading question: addMessage() and getScheduledMessages() are invoked under synchronized (this) in the dispatcher (e.g. PersistentDispatcherMultipleConsumers#trackDelayedDelivery), but clearDelayedMessages() doesn’t seem synchronized and InMemoryDelayedDeliveryTracker#clear() isn’t synchronized either.

Is clear() guaranteed to be called under the same lock, or should we align with BucketDelayedDeliveryTracker#clear() (synchronized) to avoid concurrent access to delayedMessageMap/bitmaps?

@lhotari
Copy link
Member

lhotari commented Dec 16, 2025

btw. this code location was discussed in the review: https://github.com/apache/pulsar/pull/24430/changes#r2156278377

Copy link
Contributor

@Denovo1998 Denovo1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

Comment on lines 137 to 139
updateTimer();

checkAndUpdateHighest(deliverAt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought: should updateTimer() and checkAndUpdateHighest(deliverAt) run only when we actually insert a new entryId?

With the current structure, duplicate addMessage() calls still update highestDeliveryTimeTracked / messagesHaveFixedDelay, which could disable the fixed-delay optimization even though the tracker state didn’t change.”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I have checked the earliest fixed delay implementation in #16609. In the earliest implementation, the issue is already exist. When duplicate addMessage(), highestDeliveryTimeTracked is ok, but messagesHaveFixedDelay would be set to false incorrectly.

I would check why exist duplicate addMessage() later. And I prefer that we open another pr to fix the additional issue.

Comment on lines +210 to +211
log.error("[{}] Delayed message tracker getScheduledMessages should not < 0, number is: {}",
dispatcher.getName(), n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the new n < 0 branch: this should be unreachable in normal flow. One potential way to hit it is int overflow from int cardinality = (int) entryIds.getLongCardinality().

Would it be better to keep cardinality as long (and compare cardinality <= (long) n) to eliminate overflow, instead of only logging when n < 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I think it is another issue and both use long value is better. Do you think we fix it in this pr or you push another pr to fix?




// case2: addMessage() with duplicate entryId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case2 the comment says it enters cardinality > n, but with getScheduledMessages(10) and 4 unique entryIds it should hit the cardinality <= n branch. Could we adjust the comment to match the scenario (case3 seems to be the one exercising cardinality > n)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. have changed the comment

@Denovo1998
Copy link
Contributor

btw. this code location was discussed in the review: https://github.com/apache/pulsar/pull/24430/changes#r2156278377

@thetumbled Do you have a chance to review the current PR?

@thetumbled
Copy link
Member

thetumbled commented Dec 17, 2025

btw. this code location was discussed in the review: https://github.com/apache/pulsar/pull/24430/changes#r2156278377

Maybe we should figure out the reason why duplicate entry IDs are added multiple times if this class does not intentionally allow that behavior.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug in the InMemoryDelayedDeliveryTracker where duplicate message entries could cause incorrect delayed message counts, potentially leading to NPE issues. The fix adds a duplicate check before incrementing the counter and improves error handling for edge cases.

Key changes:

  • Add duplicate entry check in addMessage() using Roaring64Bitmap.contains() before adding entries
  • Improve error handling in getScheduledMessages() to explicitly handle and log the n < 0 case
  • Add comprehensive test coverage for duplicate entry scenarios across multiple test cases

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pulsar-broker/src/main/java/org/apache/pulsar/broker/delayed/InMemoryDelayedDeliveryTracker.java Implements duplicate entry check in addMessage() and enhances error handling in getScheduledMessages()
pulsar-broker/src/test/java/org/apache/pulsar/broker/delayed/InMemoryDeliveryTrackerTest.java Adds comprehensive test method testDelayedMessagesCountWithDuplicateEntryId() covering three scenarios: multiple timestamps with duplicates, single timestamp with duplicates, and partial retrieval with duplicates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@TakaHiR07 TakaHiR07 force-pushed the fix_delayedMessagesCount_error branch 2 times, most recently from d12846f to 9b562dc Compare December 18, 2025 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants