Skip to content

Conversation

@lucliu1108
Copy link
Contributor

@lucliu1108 lucliu1108 commented Dec 21, 2025

Ticket: https://issues.apache.org/jira/browse/KAFKA-18369

Summary

This PR revises 4 DefaultStateUpdate metrics for Kafka Streams:

  • idle-ratio
  • active-restore-ratio
  • standby-update-ratio
  • checkpoint-ratio
    These metrics should represent the ratio of time spent on a certain
    action over a windowed duration, rather than an average of time spent
    ratio over iterations. This is implemented using a windowedSum
    aggregation for each of the 4 metrics, similar to
    KAFKA-18615: StreamThread *-ratio metrics suffer from sampling bias #21160

Each rolling window is controlled by 2 configs:

  1. metrics.sample.window.ms: The window of time a metrics sample is
    computed over, default as 30 seconds.
  2. metrics.num.samples: The number of samples maintained to compute
    metrics, default as 2.

Reviewers: @mjsax

@github-actions github-actions bot added triage PRs from the community streams labels Dec 21, 2025
Copy link

@Pankraz76 Pankraz76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of dry.

public static final String LATENCY_DESCRIPTION_SUFFIX = " in milliseconds";
public static final String RATE_DESCRIPTION_PREFIX = "The average number of ";
public static final String RATE_DESCRIPTION_SUFFIX = " per second";
public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, ";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, ";
public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, of the time this thread spent ";

is it only used like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this prefix consts is only used for thread. But it could be used for describing against something other units in the future.
The name of this const doesn't carry meaning about the entity (e.g.time, thread) described, so I added another const here to address that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, it’s your call, but I’d keep it simple and let whoever comes next make further adjustments if they’re ever needed.

private static final String PUNCTUATE_MAX_LATENCY_DESCRIPTION = "The maximum punctuate latency";
private static final String PROCESS_RATIO_DESCRIPTION =
"The ratio, over a rolling measurement window, of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX +

"processing active tasks to the total elapsed time in that window.";
private static final String PUNCTUATE_RATIO_DESCRIPTION =
"The ratio, over a rolling measurement window, of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX +

dry

"punctuating active tasks to the total elapsed time in that window.";
private static final String POLL_RATIO_DESCRIPTION =
"The ratio, over a rolling measurement window, of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX +

dry

"polling records from the consumer to the total elapsed time in that window.";
private static final String COMMIT_RATIO_DESCRIPTION =
"The ratio, over a rolling measurement window, of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " +
WINDOWED_RATIO_DESCRIPTION_PREFIX +

dry.

private static final String UPDATE_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "updating standby tasks";
private static final String CHECKPOINT_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "checkpointing tasks restored progress";
private static final String IDLE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spend being idle";
private static final String RESTORE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spend restoring active tasks";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final String RESTORE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spend restoring active tasks";
private static final String RESTORE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + " restoring active tasks";

not sure about this one

spent

Image

spend

Image

idk if this is any good, all I know this string occurs 22 times making it kind of dry.

"of the time this thread spen"

Image

@github-actions github-actions bot removed the triage PRs from the community label Dec 23, 2025
private static final String RESTORE_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "restoring active tasks";
private static final String UPDATE_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "updating standby tasks";
private static final String CHECKPOINT_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "checkpointing tasks restored progress";
private static final String IDLE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + THREAD_TIME_UNIT_DESCRIPTION + "being idle";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason that we split a sentence into two vars (WINDOWED_RATIO_DESCRIPTION_PREFIX and THREAD_TIME_UNIT_DESCRIPTION)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just syntax consideration. The RATIO_DESCRIPTION, like other PREFIX does not cover any actual unit (e.g. per xxx). But it's just my personal preference.

public static final String LATENCY_DESCRIPTION_SUFFIX = " in milliseconds";
public static final String RATE_DESCRIPTION_PREFIX = "The average number of ";
public static final String RATE_DESCRIPTION_SUFFIX = " per second";
public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, it’s your call, but I’d keep it simple and let whoever comes next make further adjustments if they’re ever needed.

Copy link
Contributor

@aliehsaeedii aliehsaeedii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lucliu1108, I assume that we may need to announce the changes in docs/streams/upgrade-guide.html (same as #21160)

final double activeRestoreTime = activeRestoreTimeWindowedSum.measure(metricsConfig, now);
final double standbyRestoreTime = standbyRestoreTimeWindowedSum.measure(metricsConfig, now);

final double totalTime = idleTime + checkpointTime + activeRestoreTime + standbyRestoreTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "total time" include the time the State Updater thread spends blocked on the internal queue or waiting for tasks? Is this time covered by idleTime?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the total time as totalTime = Math.max(0L, endTimeMs - totalStartTimeMs); as defined in runOnce()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idleTime should cover the time blocked on the taskAndActions queue.

Good point, i will replace totalTime with the definition in runOnce().

@aliehsaeedii
Copy link
Contributor

Thanks @lucliu1108, I assume that we may need to announce the changes in docs/streams/upgrade-guide.html (same as #21160)

what about this one @lucliu1108


## Streams API changes in 4.3.0

The streams thread metrics `commit-ratio`, `process-ratio`, `punctuate-ratio`, and `poll-ratio` , along with streams state updater metrics `active-restore-ratio`, `standby-restore-ratio`, `idle-ratio`, and `checkpoint-ratio` have been updated. Each metric now reports, over a rolling measurement window, the ratio of time this thread spends performing the given action (`{action}`) to the total elapsed time in that window. The effective window duration is determined by the metrics configuration: `metrics.sample.window.ms` (per-sample window length)and `metrics.num.samples` (number of rolling windows).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and poll-ratio ,

please remove the extra white space.

Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Few nits.

updaterMetrics.checkpointRatioSensor.record((double) totalCheckpointLatency / totalLatency, now);
recordWindowedSum(
now,
(double) totalWaitLatency,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cast explicitly? The parameter of recordWindowedSum(...) is defined as double, so it seems the explicit cast is unnecessary? Same for the rows below?

lucliu1108 and others added 2 commits January 15, 2026 16:27
@lucliu1108 lucliu1108 requested a review from mjsax January 15, 2026 22:37
Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not track "totalLatency" as a WindowedSum, too? It was done this way in #21160 as runOnceLatencyWindowedSum -- why do we not need the same approach in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants