Track external accumulators in tracer instead of using SparkInfo values#10553
Track external accumulators in tracer instead of using SparkInfo values#10553charlesmyu wants to merge 4 commits intomasterfrom
Conversation
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 68 metrics, 3 unstable metrics. Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.057 s) : 0, 1057449
Total [baseline] (8.827 s) : 0, 8827399
Agent [candidate] (1.06 s) : 0, 1059530
Total [candidate] (8.843 s) : 0, 8843339
section iast
Agent [baseline] (1.225 s) : 0, 1224554
Total [baseline] (9.587 s) : 0, 9586709
Agent [candidate] (1.228 s) : 0, 1227830
Total [candidate] (9.588 s) : 0, 9587986
gantt
title insecure-bank - break down per module: candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.2 ms) : 0, 1200
crashtracking [candidate] (1.207 ms) : 0, 1207
BytebuddyAgent [baseline] (627.634 ms) : 0, 627634
BytebuddyAgent [candidate] (629.037 ms) : 0, 629037
AgentMeter [baseline] (29.045 ms) : 0, 29045
AgentMeter [candidate] (29.109 ms) : 0, 29109
GlobalTracer [baseline] (256.393 ms) : 0, 256393
GlobalTracer [candidate] (257.641 ms) : 0, 257641
AppSec [baseline] (31.44 ms) : 0, 31440
AppSec [candidate] (31.436 ms) : 0, 31436
Debugger [baseline] (58.472 ms) : 0, 58472
Debugger [candidate] (58.329 ms) : 0, 58329
Remote Config [baseline] (603.742 µs) : 0, 604
Remote Config [candidate] (589.84 µs) : 0, 590
Telemetry [baseline] (8.624 ms) : 0, 8624
Telemetry [candidate] (8.703 ms) : 0, 8703
Flare Poller [baseline] (7.944 ms) : 0, 7944
Flare Poller [candidate] (7.4 ms) : 0, 7400
section iast
crashtracking [baseline] (1.195 ms) : 0, 1195
crashtracking [candidate] (1.215 ms) : 0, 1215
BytebuddyAgent [baseline] (794.232 ms) : 0, 794232
BytebuddyAgent [candidate] (797.742 ms) : 0, 797742
AgentMeter [baseline] (11.316 ms) : 0, 11316
AgentMeter [candidate] (11.344 ms) : 0, 11344
GlobalTracer [baseline] (247.028 ms) : 0, 247028
GlobalTracer [candidate] (246.861 ms) : 0, 246861
IAST [baseline] (25.158 ms) : 0, 25158
IAST [candidate] (25.092 ms) : 0, 25092
AppSec [baseline] (26.313 ms) : 0, 26313
AppSec [candidate] (26.277 ms) : 0, 26277
Debugger [baseline] (62.596 ms) : 0, 62596
Debugger [candidate] (62.886 ms) : 0, 62886
Remote Config [baseline] (522.741 µs) : 0, 523
Remote Config [candidate] (531.68 µs) : 0, 532
Telemetry [baseline] (14.864 ms) : 0, 14864
Telemetry [candidate] (14.844 ms) : 0, 14844
Flare Poller [baseline] (5.185 ms) : 0, 5185
Flare Poller [candidate] (4.892 ms) : 0, 4892
Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.065 s) : 0, 1064769
Total [baseline] (11.089 s) : 0, 11088576
Agent [candidate] (1.061 s) : 0, 1060971
Total [candidate] (11.041 s) : 0, 11040965
section appsec
Agent [baseline] (1.246 s) : 0, 1246079
Total [baseline] (11.207 s) : 0, 11206686
Agent [candidate] (1.245 s) : 0, 1244613
Total [candidate] (11.083 s) : 0, 11082766
section iast
Agent [baseline] (1.229 s) : 0, 1229372
Total [baseline] (11.387 s) : 0, 11387054
Agent [candidate] (1.233 s) : 0, 1232798
Total [candidate] (11.36 s) : 0, 11360308
section profiling
Agent [baseline] (1.182 s) : 0, 1181885
Total [baseline] (11.133 s) : 0, 11132711
Agent [candidate] (1.189 s) : 0, 1188624
Total [candidate] (11.068 s) : 0, 11068497
gantt
title petclinic - break down per module: candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.208 ms) : 0, 1208
crashtracking [candidate] (1.19 ms) : 0, 1190
BytebuddyAgent [baseline] (631.588 ms) : 0, 631588
BytebuddyAgent [candidate] (629.063 ms) : 0, 629063
AgentMeter [baseline] (29.32 ms) : 0, 29320
AgentMeter [candidate] (29.077 ms) : 0, 29077
GlobalTracer [baseline] (258.122 ms) : 0, 258122
GlobalTracer [candidate] (257.641 ms) : 0, 257641
AppSec [baseline] (31.575 ms) : 0, 31575
AppSec [candidate] (31.435 ms) : 0, 31435
Debugger [baseline] (59.617 ms) : 0, 59617
Debugger [candidate] (59.429 ms) : 0, 59429
Remote Config [baseline] (586.017 µs) : 0, 586
Remote Config [candidate] (587.179 µs) : 0, 587
Telemetry [baseline] (8.652 ms) : 0, 8652
Telemetry [candidate] (8.636 ms) : 0, 8636
Flare Poller [baseline] (7.974 ms) : 0, 7974
Flare Poller [candidate] (7.849 ms) : 0, 7849
section appsec
crashtracking [baseline] (1.186 ms) : 0, 1186
crashtracking [candidate] (1.184 ms) : 0, 1184
BytebuddyAgent [baseline] (658.141 ms) : 0, 658141
BytebuddyAgent [candidate] (656.856 ms) : 0, 656856
AgentMeter [baseline] (12.002 ms) : 0, 12002
AgentMeter [candidate] (12.05 ms) : 0, 12050
GlobalTracer [baseline] (258.169 ms) : 0, 258169
GlobalTracer [candidate] (257.86 ms) : 0, 257860
IAST [baseline] (23.932 ms) : 0, 23932
IAST [candidate] (23.899 ms) : 0, 23899
AppSec [baseline] (177.978 ms) : 0, 177978
AppSec [candidate] (178.099 ms) : 0, 178099
Debugger [baseline] (65.328 ms) : 0, 65328
Debugger [candidate] (65.496 ms) : 0, 65496
Remote Config [baseline] (569.732 µs) : 0, 570
Remote Config [candidate] (566.084 µs) : 0, 566
Telemetry [baseline] (8.913 ms) : 0, 8913
Telemetry [candidate] (8.814 ms) : 0, 8814
Flare Poller [baseline] (3.615 ms) : 0, 3615
Flare Poller [candidate] (3.549 ms) : 0, 3549
section iast
crashtracking [baseline] (1.189 ms) : 0, 1189
crashtracking [candidate] (1.203 ms) : 0, 1203
BytebuddyAgent [baseline] (796.554 ms) : 0, 796554
BytebuddyAgent [candidate] (800.887 ms) : 0, 800887
AgentMeter [baseline] (11.349 ms) : 0, 11349
AgentMeter [candidate] (11.503 ms) : 0, 11503
GlobalTracer [baseline] (248.323 ms) : 0, 248323
GlobalTracer [candidate] (247.748 ms) : 0, 247748
IAST [baseline] (25.135 ms) : 0, 25135
IAST [candidate] (25.31 ms) : 0, 25310
AppSec [baseline] (26.358 ms) : 0, 26358
AppSec [candidate] (26.419 ms) : 0, 26419
Debugger [baseline] (63.853 ms) : 0, 63853
Debugger [candidate] (63.299 ms) : 0, 63299
Remote Config [baseline] (532.419 µs) : 0, 532
Remote Config [candidate] (525.386 µs) : 0, 525
Telemetry [baseline] (14.943 ms) : 0, 14943
Telemetry [candidate] (14.852 ms) : 0, 14852
Flare Poller [baseline] (4.987 ms) : 0, 4987
Flare Poller [candidate] (4.892 ms) : 0, 4892
section profiling
crashtracking [baseline] (1.166 ms) : 0, 1166
crashtracking [candidate] (1.181 ms) : 0, 1181
BytebuddyAgent [baseline] (682.435 ms) : 0, 682435
BytebuddyAgent [candidate] (687.335 ms) : 0, 687335
AgentMeter [baseline] (8.614 ms) : 0, 8614
AgentMeter [candidate] (8.668 ms) : 0, 8668
GlobalTracer [baseline] (215.44 ms) : 0, 215440
GlobalTracer [candidate] (216.345 ms) : 0, 216345
AppSec [baseline] (31.827 ms) : 0, 31827
AppSec [candidate] (32.059 ms) : 0, 32059
Debugger [baseline] (65.401 ms) : 0, 65401
Debugger [candidate] (65.394 ms) : 0, 65394
Remote Config [baseline] (578.813 µs) : 0, 579
Remote Config [candidate] (578.657 µs) : 0, 579
Telemetry [baseline] (8.186 ms) : 0, 8186
Telemetry [candidate] (8.192 ms) : 0, 8192
Flare Poller [baseline] (3.495 ms) : 0, 3495
Flare Poller [candidate] (3.503 ms) : 0, 3503
ProfilingAgent [baseline] (93.805 ms) : 0, 93805
ProfilingAgent [candidate] (94.251 ms) : 0, 94251
Profiling [baseline] (94.372 ms) : 0, 94372
Profiling [candidate] (94.817 ms) : 0, 94817
LoadParameters
See matching parameters
SummaryFound 1 performance improvements and 2 performance regressions! Performance is the same for 18 metrics, 15 unstable metrics.
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section baseline
no_agent (18.94 ms) : 18748, 19131
. : milestone, 18940,
appsec (18.363 ms) : 18180, 18546
. : milestone, 18363,
code_origins (17.903 ms) : 17725, 18081
. : milestone, 17903,
iast (17.878 ms) : 17696, 18060
. : milestone, 17878,
profiling (19.718 ms) : 19520, 19916
. : milestone, 19718,
tracing (17.55 ms) : 17379, 17722
. : milestone, 17550,
section candidate
no_agent (19.207 ms) : 19010, 19404
. : milestone, 19207,
appsec (18.585 ms) : 18398, 18772
. : milestone, 18585,
code_origins (17.64 ms) : 17462, 17818
. : milestone, 17640,
iast (17.912 ms) : 17734, 18090
. : milestone, 17912,
profiling (18.953 ms) : 18762, 19144
. : milestone, 18953,
tracing (17.498 ms) : 17324, 17672
. : milestone, 17498,
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section baseline
no_agent (1.203 ms) : 1192, 1215
. : milestone, 1203,
iast (3.109 ms) : 3068, 3149
. : milestone, 3109,
iast_FULL (5.72 ms) : 5664, 5777
. : milestone, 5720,
iast_GLOBAL (3.477 ms) : 3419, 3534
. : milestone, 3477,
profiling (2.046 ms) : 2029, 2064
. : milestone, 2046,
tracing (1.862 ms) : 1846, 1877
. : milestone, 1862,
section candidate
no_agent (1.182 ms) : 1170, 1193
. : milestone, 1182,
iast (3.364 ms) : 3324, 3404
. : milestone, 3364,
iast_FULL (5.752 ms) : 5695, 5810
. : milestone, 5752,
iast_GLOBAL (3.463 ms) : 3411, 3516
. : milestone, 3463,
profiling (2.084 ms) : 2065, 2102
. : milestone, 2084,
tracing (1.796 ms) : 1781, 1811
. : milestone, 1796,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics. Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section baseline
no_agent (15.4 s) : 15400000, 15400000
. : milestone, 15400000,
appsec (15.006 s) : 15006000, 15006000
. : milestone, 15006000,
iast (17.741 s) : 17741000, 17741000
. : milestone, 17741000,
iast_GLOBAL (17.357 s) : 17357000, 17357000
. : milestone, 17357000,
profiling (14.796 s) : 14796000, 14796000
. : milestone, 14796000,
tracing (15.117 s) : 15117000, 15117000
. : milestone, 15117000,
section candidate
no_agent (14.97 s) : 14970000, 14970000
. : milestone, 14970000,
appsec (15.066 s) : 15066000, 15066000
. : milestone, 15066000,
iast (17.927 s) : 17927000, 17927000
. : milestone, 17927000,
iast_GLOBAL (17.643 s) : 17643000, 17643000
. : milestone, 17643000,
profiling (14.817 s) : 14817000, 14817000
. : milestone, 14817000,
tracing (15.147 s) : 15147000, 15147000
. : milestone, 15147000,
Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~7e4b7dec99, baseline=1.61.0-SNAPSHOT~70410da0e2
dateFormat X
axisFormat %s
section baseline
no_agent (1.481 ms) : 1470, 1493
. : milestone, 1481,
appsec (3.848 ms) : 3624, 4072
. : milestone, 3848,
iast (2.287 ms) : 2216, 2357
. : milestone, 2287,
iast_GLOBAL (2.319 ms) : 2248, 2389
. : milestone, 2319,
profiling (2.134 ms) : 2076, 2192
. : milestone, 2134,
tracing (2.086 ms) : 2032, 2141
. : milestone, 2086,
section candidate
no_agent (1.484 ms) : 1472, 1495
. : milestone, 1484,
appsec (3.85 ms) : 3625, 4075
. : milestone, 3850,
iast (2.278 ms) : 2208, 2348
. : milestone, 2278,
iast_GLOBAL (2.327 ms) : 2256, 2397
. : milestone, 2327,
profiling (2.125 ms) : 2068, 2183
. : milestone, 2125,
tracing (2.092 ms) : 2037, 2146
. : milestone, 2092,
|
4e5bdc7 to
ba09c80
Compare
cde7981 to
e52fbc5
Compare
e52fbc5 to
e413d1d
Compare
89df516 to
8651527
Compare
This stack of pull requests is managed by Graphite. Learn more about stacking. |
8651527 to
7e4b7de
Compare
pawel-big-lebowski
left a comment
There was a problem hiding this comment.
Nice, elegant implementation tackling a complex problem — I only left a small comment.
| private static final MethodHandles methodLoader = | ||
| new MethodHandles(ClassLoader.getSystemClassLoader()); | ||
| private static final MethodHandle externalAccums = | ||
| methodLoader.method(TaskMetrics.class, "externalAccums"); |
There was a problem hiding this comment.
could you provide some doc on why do we need reflection and which Spark version support externalAccums/withExternalAccums?
There was a problem hiding this comment.
I can't find any good public-facing docs for this (probably since it's an internal API), but it seems like the relevant commit is here: apache/spark@b33a3ee
Somewhere in Spark v3.5.2, there was a change to move from directly accessing externalAccums to using the withExternalAccums pattern. Unfortunately it seems like it was to remediate a performance regression so there wasn't any backwards compatibility provided with that change, and as a result we need reflection to figure out which method to use when pulling the accumulators.

What Does This Do
Updates the metrics in the
_dd.spark.sql_planmeta field to use distributions calculated from individual task metrics, rather than the naively summed metrics provided by theStageInfoobjects from Spark. This is becauseStageInfonaively sums all accumulators, even though that may not make sense for certain Spark SQL metrics (e.g. avg hash probes per key for aggr operations). Instead, we should accumulate those ourselves into distribution metrics and emit them accordingly.Currently in the UI, this is only used in one place (in the Spark SQL metrics in the DJM product), so we're not too worried about changing the format here. UI update to follow.
If any issues arise with sending traces with a larger number of histograms, we can disable it using the
DD_SPARK_TASK_HISTOGRAM_ENABLEDfeature flag.Motivation
We'd like accurate metrics for Spark SQL operations that can reflect task-level characteristics as a distribution. This brings us more in line with what is shown in the Spark UI:

Additional Notes
We can't get rid of the original map that tracks accumulators to stages as we still use that to associate Spark SQL operations to stages. However, we can avoid storing the entire accumulator now, and instead just store a simple map of accumulator ID to stage ID. This will be done in a followup PR: #10645
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]
Note: Once your PR is ready to merge, add it to the merge queue by commenting
/merge./merge -ccancels the queue request./merge -f --reason "reason"skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.