-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-18615: StreamThread *-ratio metrics suffer from sampling bias #21160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java
Outdated
Show resolved
Hide resolved
mjsax
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also update the description of each metric, saying it's windowed (eg in ThreadMetrics.java)
bbejeck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aliehsaeedii - overall LGTM - can we add a description to docs/upgrade.html section for 4.2?
mjsax
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Few more minor things.
streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java
Outdated
Show resolved
Hide resolved
streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java
Outdated
Show resolved
Hide resolved
| final double latencyWindow = | ||
| windowedSum.measure(metricsConfig, now); | ||
| ratioSensor.record(latencyWindow / runOnceLatencyWindow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| final double latencyWindow = | |
| windowedSum.measure(metricsConfig, now); | |
| ratioSensor.record(latencyWindow / runOnceLatencyWindow); | |
| ratioSensor.record(windowedSum.measure(metricsConfig, now) / runOnceLatencyWindow); |
mjsax
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
bbejeck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aliehsaeedii LGTM, with one minor comment
docs/streams/upgrade-guide.html
Outdated
| More details can be found in <a href="https://cwiki.apache.org/confluence/x/jQobFw">KIP-1221</a>. | ||
| </p> | ||
|
|
||
| <p>The streams thread metrics <code>commit-ratio</code>, <code>process-ratio</code>, <code>punctuate-ratio</code>, and <code>poll-ratio</code> have been updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix wont' get into AK 4.2, but only AK 4.3. We are passed code-freeze and only blockers can get merged now.
We need to start a new section for 4.3 and move this change there.
|
Thanks for the fix. Merged to |
Just for my information: how are we planning to merge this, given the mandatory code freeze around this time of year? This is only for development and will not be released anytime soon, correct? However, if a hotfix were required, it would no longer be possible due to the merged feature. I may have understood the code freeze policy differently, so I would appreciate some clarification. Thanks. |
|
Code freeze only applies to |
|
Thanks for the nifty update. So we could also consider other fixes, like: |
This PR implements KAFKA-18615 by adding a windowSum aggregation that
computes the sum of values over a time window, so that
commit-ratio,poll-ratio,process-ratio, andpunctuate-ratiorepresent the ratioof the
{action}over a window duration rather than a single iteration.The effective window duration is whatever you configure for metrics:
metrics.sample.window.ms(per-sample window length)times metrics.num.samples(number of rolling windows)With the default Kafka metrics config, that is typically:
metrics.sample.window.ms = 30000 msmetrics.num.samples = 2→ ~60seconds total rolling window.
Reviewers: Matthias J. Sax matthias@confluent.io, Bill Bejeck
bill@confluent.io, Vincent Potuček (@Pankraz76)