Skip to content

JAVA-5950 Update Transactions Convenient API with exponential backoff on retries#1899

Open
nhachicha wants to merge 61 commits intomongodb:backpressurefrom
nhachicha:nh/backpressure_convenient_api
Open

JAVA-5950 Update Transactions Convenient API with exponential backoff on retries#1899
nhachicha wants to merge 61 commits intomongodb:backpressurefrom
nhachicha:nh/backpressure_convenient_api

Conversation

@nhachicha
Copy link
Copy Markdown
Collaborator

@nhachicha nhachicha commented Feb 25, 2026

Original PR accidentally closed #1852, it has outstanding review comments for @stIncMale to go over when re-reviewing.

Relevant specification changes:

AI review

Review generated by Claude Opus 4.6 as of commit `4a3d1ae1` on 2026-04-01.

Findings Table — Diff-Only Review vs. PR Context

# Finding Verdict Rationale
1 Non-volatile testJitterSupplier static field — data race risk across threads Valid, tracked Multiple reviewers flagged. Test-only, deferred to JAVA-6079. Should be volatile at minimum.
2 Thread.sleep(backoffMs) may overshoot remaining timeout False positive shortenBy(...).onExpired(...) checks before sleeping; next iteration fails fast if expired. Overshoot bounded by
500ms cap.
3 clearTransactionContextOnError(e) call removed False positive @stIncMale confirmed: commitTransaction calls it internally; the outer call was redundant.
4 Error wrapping obscures original exception type Valid, intentional Spec-mandated (DRIVERS-3391). Labels are now copied from cause to wrapper. Tests verify.
5 Visibility widening of CommandOperationHelper + constants False positive Everything under com.mongodb.internal is internal by definition. Enables cross-package dedup.
6 @VisibleForTesting on timeoutOrAlternative Resolved @stIncMale requested removal; addressed in cherry-picked refactor (60acf51d).
7 TRANSACTION_MAX_MS package-private; hardcoded expected values in test Minor nit Iteration bounds now based on EXPECTED_BACKOFFS_MAX_VALUES.length (adopted). Hardcoded array is an
independent oracle — acceptable.
8 TODO comments in production code Valid, accepted Will be resolved before backpressuremain merge. Blocked on docs PR.
9 testRetryBackoffIsEnforced wall-clock timing sensitivity Valid, mitigated Jitter controlled via setTestJitterSupplier. 500ms tolerance. Spec-mandated prose test.
10 Deleted copyTimeoutContext() — verify no other callers False positive Explicitly requested by @stIncMale. Private constructor also removed as dead code.

Summary: 4 false positives, 2 resolved, 1 valid+intentional, 1 tracked, 1 accepted, 1 minor nit.


Additional Findings from PR Reviewers (not caught in diff-only review)

# Finding Source Status
A calculateTransactionBackoffMs Javadoc said 0-based but implementation is 1-based Copilot, @stIncMale Fixed
B Error labels not copied from wrapped cause to MongoTimeoutException @stIncMale Fixed (d4bc4c70)
C applyMajorityWriteConcernToTransactionOptions called on outer retry @stIncMale Fixed (60acf51d)
D withTransaction code structure doesn't follow spec algorithm ordering @stIncMale Fixed (60acf51d)
E Spec submodule not updated to latest spec tests @stIncMale Deferredmain updated separately
F AI-generated comments cluttering test code @stIncMale Fixed
G MongoTimeoutException message inconsistency (missing period) Copilot Fixed
H timeoutOrAlternative removal + timeoutMsConfigured renaming @stIncMale Fixed (60acf51d)

Remaining Issues to Address

# Issue Severity File(s) Status Tracking
1 testJitterSupplier should be volatile — mutable static read/written across threads without memory visibility guarantee Low (test-only) ExponentialBackoff.java Open — deferred JAVA-6079

|
| 2 | Spec "Note 1" about error propagation needs rework — spec language around propagation/raising is inconsistent | Medium | Spec-side | Open — spec change needed | DRIVERS-3436 |
| 3 | TODO-BACKPRESSURE comments in production code — must be resolved before merging backpressuremain | Low | MongoException.java, ExponentialBackoff.java | Open — blocked on docs PR
(10gen/docs-mongodb-internal#17281) | Implicit |
| 4 | Spec submodule not pointing to latest spec tests — new transactions-convenient-api JSON tests not included | Medium | testing/resources/specifications (submodule) | Open — main updated
(55e1861) but not merged into backpressure | TODO-BACKPRESSURE |
| 5 | Verify spec test withTransaction surfaces a timeout after exhausting transient transaction retries is run | Medium | Test runner config | Open — reminder in PR comments | — |
| 6 | Verify prose tests assert all error labels are copied to wrapping exception | Medium | WithTransactionProseTest.java | Partially addressed — labels copied in code, test coverage completeness not
confirmed | — |
| 7 | OTel tracing terminology inconsistency (finalize vs finish vs stop) | Low (orthogonal) | ClientSessionImpl.java tracing code | Open — no ticket yet | — |

Priority Summary

  • Must resolve before merge to backpressure: Items 5, 6
  • Must resolve before merge to main: Items 3 (TODOs), 4 (submodule)
  • Tracked for future work: Items 1 (JAVA-6079), 2 (DRIVERS-3436), 7

What Looks Good

  • Exponential backoff formula (base=5ms, growth=1.5x, cap=500ms) with jitter is well-designed for transaction retries
  • Replacing ClientSessionClock singleton with SystemNanoTime + Mockito is a significant test infrastructure improvement
  • ExponentialBackoffTest has good coverage of boundary conditions (jitter=0, jitter=1, cap enforcement)
  • backpressure: true handshake flag is a clean protocol extension
  • Error label propagation through timeout wrapping preserves diagnostic information
  • abortIfInTransaction() extraction reduces duplication and improves clarity
  • @stIncMale's refactor of withTransaction (60acf51d) now aligns the code with the spec algorithm, making correctness easier to verify
  • New prose tests (testRetryBackoffIsEnforced, testExponentialBackoffOnTransientError) provide functional validation of backoff behavior

strogiyotec and others added 30 commits February 25, 2026 13:00
…Impl.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…s exceeded (ex operationContext.getTimeoutContext().getReadTimeoutMS())
…tionProseTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
…tionProseTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
…tionProseTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
…tionProseTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
…tionProseTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
…tionProseTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
…tionProseTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
@nhachicha nhachicha requested a review from stIncMale March 26, 2026 19:26
stIncMale added a commit to stIncMale/mongo-java-driver that referenced this pull request Mar 26, 2026
I manually copied it from mongodb#1899.
Copy link
Copy Markdown
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last reviewed commit is 43dab53.

Most if not all of the outstanding comments have reactions/replies suggesting that they were agreed with and addressed, but I did not find the corresponding changes. I suspect, the changes were not pushed.

@@ -249,15 +257,26 @@ public <T> T withTransaction(final TransactionBody<T> transactionBody) {
@Override
public <T> T withTransaction(final TransactionBody<T> transactionBody, final TransactionOptions options) {
Copy link
Copy Markdown
Member

@stIncMale stIncMale Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is orthogonal to this PR

It is unrelated, indeed, but I noticed all that while trying to make our withTransaction method to look like it follows the spec. The open telemetry implementation we have is recent, as far as I know, and its state does not seem good. That is surprising, given that it's not some old code that went out of shape as a result of having been modified many times without ever having been refactored to make sense again.

1.

Our tracing layer uses Micrometer as the OTel reference implementation, and the APIs use different terminology. e.g. Micrometer's uses stop whereas OTel uses end

So we have
a) The two APIs mentioned use the terms "stop" and "end"
b) The drivers specification uses "finish", the Java driver implementation uses "finalize".

I fail to see how b) reasonably follows from a).

It's worth aligning with the spec

How did we end up unaligned, when we authored both?

2, 3

The open telemetry specification "defines requirements for drivers' OpenTelemetry integration and behavior". I am guessing, that is to ensure that different drivers emit the same telemetry in the same way. However, how will other drivers know how to emit it for withTransaction, when none of the behavior you described above is in the specification? (I still don't really know what the behavior is supposed to be).

Testing

Many open telemetry specification tests were skipped in the Java driver, with a reference to https://jira.mongodb.org/browse/JAVA-5991, and then the ticket was closed. But nothing in the ticket explains why they are skipped and whether that is supposed to change (@rozza reopened the ticket as a result). This is especially surprising given that we were the authors of the open telemetry specification.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming 👍 was meant to express that you agree with the comment and it is addressed, I can't find the corresponding requested comment in #1918. Could you please add it?

Comment on lines +408 to +412
private static MongoException timeoutException(final boolean hasTimeoutMS, final Throwable cause) {
return hasTimeoutMS
? createMongoTimeoutException(cause) // CSOT timeout exception
: new MongoTimeoutException("Operation exceeded the timeout limit", cause); // Legacy timeout exception
}
Copy link
Copy Markdown
Member

@stIncMale stIncMale Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the change in d4bc4c7 is the proper way of addressing this. Most of what I wrote previously in this thread was not addressed.

The following thoughts won't add anything new to what I expressed above, but they will be clearer than before, because now there is less uncertainty/questions:

  1. The spec changes made in DRIVERS-3391 need more work/fixes (such a new change requires a new DRIVERS ticket):
    • The Note 1 should be changed such that it instructs to add all the error labels from the wrapped error, regardless of what the wrapped error is1.
    • The spec currently says "report a timeout error wrapping the last error", but then refers to the wrapped error as "underlying error". The spec should say "wrapped error" instead of introducing another word that is supposed to have the same meaning.
    • The "Note 1" in the Retry Timeout is Enforced prose tests should be updated to instruct the drivers to assert that the timeout error has the same labels as the error it wraps.
  2. It seems that constructors of MongoException should be responsible for copying labels. However, MongoException(@Nullable final String msg, @Nullable final Throwable t) does not do that, while MongoException(final int code, final String msg, final Throwable t) does. We should figure out whether this was clearly intentional and copying labels in the aforementioned constructor will be a bug, or if the current situation is a bug, and the constructor must have been copying labels.

1 Strictly speaking, we need to copy only the labels the driver exposes for applications to use (those are exposed via constants in MongoException). However, given that no driver, including ours, hides other labels, there is no reason to complicate the logic or the specification here.

nhachicha and others added 2 commits March 31, 2026 13:27
…lBackoffTest.java

Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com>
- Add SYSTEM_OVERLOADED_ERROR_LABEL and RETRYABLE_ERROR_LABEL constants to MongoException
- Add backpressure:true to hello command in InternalStreamConnectionInitializer
- Make CommandOperationHelper and its error label constants public
- Replace hardcoded error label strings with constants in tests and examples
- Refactor ExponentialBackoff: make TRANSACTION_BASE_MS and TRANSACTION_GROWTH private,
  split testCustomJitter into two tests, minor Javadoc/assertion message fixes
- Remove redundant private constructor from TimeoutContext
- Convert block comments to Javadoc in WithTransactionProseTest, refactor testRetryBackoffIsEnforced
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…Impl.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nhachicha nhachicha requested review from katcharov and stIncMale April 1, 2026 17:20
UnknownTransactionCommitResult is retriable in the commit loop if we don't exceed the timeout, so it makes sense to wrap it into a Timeout error if we exceed the timeout and want to throw and return (as described in section 10.1.1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants