Skip to content

[telemetry] Add retry logic for partial telemetry push failures#1224

Open
akgitrepos wants to merge 3 commits intodatabricks:mainfrom
akgitrepos:telemetry-retry-fix
Open

[telemetry] Add retry logic for partial telemetry push failures#1224
akgitrepos wants to merge 3 commits intodatabricks:mainfrom
akgitrepos:telemetry-retry-fix

Conversation

@akgitrepos
Copy link

Description

Implements retry logic for partial telemetry push failures in TelemetryPushClient. When the telemetry service returns a partial success response (numProtoSuccess < totalEvents), the client now automatically retries up to 3 times with exponential backoff (1s → 2s → 4s, max 10s).

This addresses the TODO comment in TelemetryPushClient.java.

Changes Made

  • Added configurable maxRetries parameter (default: 3)
  • Implemented exponential backoff strategy (aligned with DatabricksHttpRetryHandler)
  • Added unit tests for retry behavior

Testing

  • Added 2 new unit tests:
    • pushEvent_noRetryOnFullSuccess - Verifies no retry when all events succeed
    • pushEvent_retriesOnPartialSuccess - Verifies retry is triggered on partial failure
  • All 14 TelemetryPushClient tests pass

Additional Notes

  • Backward compatible - existing constructor still works with default retry count
  • Idempotent - retries all events (safe for duplicates)

@akgitrepos
Copy link
Author

@gopalldb @vikrantpuppala Hi, I've implemented retry logic for partial telemetry push failures which was one of the TODO item. Would appreciate a review. Thanks!

Copy link
Collaborator

@samikshya-db samikshya-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great addition here, @akgitrepos ! 🥳

Added a few comments.


@Override
public void pushEvent(TelemetryRequest request) throws Exception {
pushEventWithRetry(request, maxRetries);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : can you move the contents of pushEventWithRetry in this function itself?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the recursive pushEventWithRetry into pushEvent with an iterative approach.

private static final int DEFAULT_MAX_RETRIES = 3;
private static final long INITIAL_BACKOFF_MS =
1000; // 1 second - matches DatabricksHttpRetryHandler
private static final long MAX_BACKOFF_MS = 10000; // 10 seconds - matches codebase standard
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way would be to extract calculateExponentialBackoff and doSleepForDelay from DatabricksHttpRetryHandler as an util, and reuse it here.

The reason I was saying this is because the backoff logic and thread interruption logic is more mature there. (Also to honor DRY)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created RetryUtil shared utility class. Both DatabricksHttpRetryHandler and TelemetryPushClient now reuse the same backoff calculation and thread-sleep logic.

Implements retry logic for partial telemetry push failures. When the
telemetry service returns a partial success response, the client now
automatically retries up to 3 times with exponential backoff.

This addresses the TODO comment in TelemetryPushClient.java.

Signed-off-by: Akshey Sigdel <sigdelakshey@gmail.com>

# Conflicts:
#	src/test/java/com/databricks/jdbc/telemetry/TelemetryPushClientTest.java
Signed-off-by: Akshey Sigdel <sigdelakshey@gmail.com>
@akgitrepos
Copy link
Author

@samikshya-db Any additional feedback on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants