-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-18369: State updater's *-ratio metrics are incorrect #21201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
Pankraz76
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of dry.
| public static final String LATENCY_DESCRIPTION_SUFFIX = " in milliseconds"; | ||
| public static final String RATE_DESCRIPTION_PREFIX = "The average number of "; | ||
| public static final String RATE_DESCRIPTION_SUFFIX = " per second"; | ||
| public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, "; | |
| public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, of the time this thread spent "; |
is it only used like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this prefix consts is only used for thread. But it could be used for describing against something other units in the future.
The name of this const doesn't carry meaning about the entity (e.g.time, thread) described, so I added another const here to address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, it’s your call, but I’d keep it simple and let whoever comes next make further adjustments if they’re ever needed.
| private static final String PUNCTUATE_MAX_LATENCY_DESCRIPTION = "The maximum punctuate latency"; | ||
| private static final String PROCESS_RATIO_DESCRIPTION = | ||
| "The ratio, over a rolling measurement window, of the time this thread spent " + | ||
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + | |
| WINDOWED_RATIO_DESCRIPTION_PREFIX + |
| "processing active tasks to the total elapsed time in that window."; | ||
| private static final String PUNCTUATE_RATIO_DESCRIPTION = | ||
| "The ratio, over a rolling measurement window, of the time this thread spent " + | ||
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + | |
| WINDOWED_RATIO_DESCRIPTION_PREFIX + |
dry
| "punctuating active tasks to the total elapsed time in that window."; | ||
| private static final String POLL_RATIO_DESCRIPTION = | ||
| "The ratio, over a rolling measurement window, of the time this thread spent " + | ||
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + | |
| WINDOWED_RATIO_DESCRIPTION_PREFIX + |
dry
| "polling records from the consumer to the total elapsed time in that window."; | ||
| private static final String COMMIT_RATIO_DESCRIPTION = | ||
| "The ratio, over a rolling measurement window, of the time this thread spent " + | ||
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spent " + | |
| WINDOWED_RATIO_DESCRIPTION_PREFIX + |
dry.
| private static final String UPDATE_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "updating standby tasks"; | ||
| private static final String CHECKPOINT_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "checkpointing tasks restored progress"; | ||
| private static final String IDLE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spend being idle"; | ||
| private static final String RESTORE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spend restoring active tasks"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private static final String RESTORE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + "of the time this thread spend restoring active tasks"; | |
| private static final String RESTORE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + " restoring active tasks"; |
not sure about this one
spent
spend
idk if this is any good, all I know this string occurs 22 times making it kind of dry.
"of the time this thread spen"
| private static final String RESTORE_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "restoring active tasks"; | ||
| private static final String UPDATE_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "updating standby tasks"; | ||
| private static final String CHECKPOINT_RATIO_DESCRIPTION = RATIO_DESCRIPTION + "checkpointing tasks restored progress"; | ||
| private static final String IDLE_RATIO_DESCRIPTION = WINDOWED_RATIO_DESCRIPTION_PREFIX + THREAD_TIME_UNIT_DESCRIPTION + "being idle"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason that we split a sentence into two vars (WINDOWED_RATIO_DESCRIPTION_PREFIX and THREAD_TIME_UNIT_DESCRIPTION)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just syntax consideration. The RATIO_DESCRIPTION, like other PREFIX does not cover any actual unit (e.g. per xxx). But it's just my personal preference.
| public static final String LATENCY_DESCRIPTION_SUFFIX = " in milliseconds"; | ||
| public static final String RATE_DESCRIPTION_PREFIX = "The average number of "; | ||
| public static final String RATE_DESCRIPTION_SUFFIX = " per second"; | ||
| public static final String WINDOWED_RATIO_DESCRIPTION_PREFIX = "The ratio, over a rolling measurement window, "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, it’s your call, but I’d keep it simple and let whoever comes next make further adjustments if they’re ever needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucliu1108, I assume that we may need to announce the changes in docs/streams/upgrade-guide.html (same as #21160)
| final double activeRestoreTime = activeRestoreTimeWindowedSum.measure(metricsConfig, now); | ||
| final double standbyRestoreTime = standbyRestoreTimeWindowedSum.measure(metricsConfig, now); | ||
|
|
||
| final double totalTime = idleTime + checkpointTime + activeRestoreTime + standbyRestoreTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "total time" include the time the State Updater thread spends blocked on the internal queue or waiting for tasks? Is this time covered by idleTime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use the total time as totalTime = Math.max(0L, endTimeMs - totalStartTimeMs); as defined in runOnce()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idleTime should cover the time blocked on the taskAndActions queue.
Good point, i will replace totalTime with the definition in runOnce().
what about this one @lucliu1108 |
docs/streams/upgrade-guide.md
Outdated
|
|
||
| ## Streams API changes in 4.3.0 | ||
|
|
||
| The streams thread metrics `commit-ratio`, `process-ratio`, `punctuate-ratio`, and `poll-ratio` , along with streams state updater metrics `active-restore-ratio`, `standby-restore-ratio`, `idle-ratio`, and `checkpoint-ratio` have been updated. Each metric now reports, over a rolling measurement window, the ratio of time this thread spends performing the given action (`{action}`) to the total elapsed time in that window. The effective window duration is determined by the metrics configuration: `metrics.sample.window.ms` (per-sample window length)and `metrics.num.samples` (number of rolling windows). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and
poll-ratio,
please remove the extra white space.
mjsax
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Few nits.
| updaterMetrics.checkpointRatioSensor.record((double) totalCheckpointLatency / totalLatency, now); | ||
| recordWindowedSum( | ||
| now, | ||
| (double) totalWaitLatency, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to cast explicitly? The parameter of recordWindowedSum(...) is defined as double, so it seems the explicit cast is unnecessary? Same for the rows below?
streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Matthias J. Sax <mjsax@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not track "totalLatency" as a WindowedSum, too? It was done this way in #21160 as runOnceLatencyWindowedSum -- why do we not need the same approach in this PR?
Ticket: https://issues.apache.org/jira/browse/KAFKA-18369
Summary
This PR revises 4 DefaultStateUpdate metrics for Kafka Streams:
idle-ratioactive-restore-ratiostandby-update-ratiocheckpoint-ratioThese metrics should represent the ratio of time spent on a certain
action over a windowed duration, rather than an average of time spent
ratio over iterations. This is implemented using a windowedSum
aggregation for each of the 4 metrics, similar to
KAFKA-18615: StreamThread *-ratio metrics suffer from sampling bias #21160
Each rolling window is controlled by 2 configs:
metrics.sample.window.ms: The window of time a metrics sample iscomputed over, default as 30 seconds.
metrics.num.samples: The number of samples maintained to computemetrics, default as 2.
Reviewers: @mjsax