-
Notifications
You must be signed in to change notification settings - Fork 121
Reduce garbage generation in prometheus_text_format
#196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This looks amazing! I see it is still marked as draft so I guess no rush, and I'm away from the computer for a few days so only having a look at this on my phone now. Nevertheless, I'd love to see this ready and merged, thank you so much for the work 😃 |
ceb4c9f to
ce32cfe
Compare
|
Yep no real rush on this! Looks like I have some work to do to make the CI happy anyways |
|
I've approved the workflow run. |
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
NelsonVides
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still away from a proper computer so still quickly looking at this from my phone. Thanks a lot for having it pass CI!
I have a question, for good-looking code, the process dictionary's a bit ugly. Is there a way to refactor those usages to pass something more functional or does it have a performance impact that way?
Also for the compiled regex, perhaps storing it in a persistent term that is created at startup could help that perform even faster? It would literally be compiled only once through the entire VM lifetime and never require any GC.
If you don't see an easy way to improve that then it's probably fine to merge this way and later when I'm back to a full computer I'll try to refactor and tag you in a potential PR 🤔
|
Yeah the process dictionary part here is really icky. I thought in my earlier testing that I saw For the |
850a174 to
f7dc0e1
Compare
NelsonVides
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's so many checks against whether something is a binary and if it is not then iolist_to_binary, that I really wish everything had been just binaries to begin with. But that's a breaking change for another day 😄
Anyway, I have a couple of comments regarding a bit more crazy performance optimisations and making tracing more readable. It's not really important and if it is too annoying I could just shuffle this code myself in a couple of weeks. But maybe you like the idea or you didn't know those tricks and want to do it yourself so just sharing. WDYT? 🙂
`prometheus_text_format:has_special_char/1` is called very often when a registry contains many metrics with label pairs. We can use `binary:match/2` to search within a label binary for the special characters (newline, backslash and double-quote) without allocation. The old code using binary match syntax creates a match context every time the function is called (except, not recursion - then the match context is reused). A match context allocates 5 words to the process heap when it is created. When matching many many binaries this scales to create a noticeable amount of short-lived garbage. In comparison `binary:match/2` with a precompiled match pattern does not allocate. The BIF for it is also very well optimized, using `memchr` since OTP 22.
f7dc0e1 to
5662cb0
Compare
|
All of that sounds good to me - I applied all of the suggestions. For the |
NelsonVides
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loving it. Got one more question :D
NelsonVides
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just dirty stuff, this was there before your code changes anyway, it just now shows on the diff.
06af086 to
715663d
Compare
The formatting callback for a registry can build each metrics family as
a single binary in order to reduce garbage. This mainly involves passing
the accumulator binary through all functions that append to it.
It's more efficient to append to the resulting binary than to allocate
smaller binaries and then append them. For example:
<<Blob/binary, Name/binary, "_", Suffix/binary>>.
%% versus
Combined = <<Name/binary, "_", Suffix/binary>>,
<<Blob/binary, Combined/binary>>.
The first expression generates less garbage than the second. A good
example of this was the `add_brackets/1` function which was inlined.
Inlining does not turn the first expression (above) into the second
according to the compiler unfortunately, so we pay the cost of creating
a binary with brackets and then formatting that into the larger blob,
rather than formatting in just by copying. This change manually inlines
`add_brackets/1` into its caller `render_series/4`.
This change also changes some list strings into binaries. Especially for
ASCII, strings binaries are _far_ more compact than lists. Lists need
two words per ASCII character - one for the character and one for the
tail pointer. So it's like UTF-32 but worse, basically UTF-128 on a 64
bit machine. ASCII or UTF-8 text in binaries takes one byte per
character in the binary's array, plus a word or two of metadata. E.g.
`<<"hello">>` allocates three words while `"hello"` allocates ten.
Building on the work in the parent commit, now that the data being passed to the `ram_file` is a binary, we can instead build the entire output gradually within the process. We pay in terms of I/O overhead from writing and then reading from the `ram_file` since `ram_file` is a port - all data is passed between the VM and the port driver. The memory consumed by a port driver is also invisible to the VM's allocator, so large port driver resource usage should be avoided where possible. Instead this change refactors the `registry_collect_callback` to fold over collectors and build an accumulator. The `create_mf` callback's return of `ok` forces us to store this rather than pass and return it. So it's a little less hygienic but is more efficient than passing data in/out of a port. This also introduces a function `format_into/3` which can use this folding function directly. This can be used to avoid collecting the entire response in one binary. Instead the response can be streamed with `cowboy_req:stream_body/3` for example.
7102c10 to
665ce75
Compare
NelsonVides
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A blast of a change! Thank you so much for all the effort and also, for keeping such a tidy git history :)
|
@the-mikedavis gonna get a release to hex done over the weekend 👍🏽 |
|
Sweet, thanks @NelsonVides! |
|
Published https://hex.pm/packages/prometheus/6.1.0 🎉 |
| "\n" | ||
| >>, | ||
| Bin = render_metrics(Prologue, Name, Metrics), | ||
| put(?MODULE, Fmt(Bin, erase(?MODULE))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah, I made a mistake here with the order of the arguments. The function should take the state as the first argument and then the new data as the second argument. It doesn't end up making a difference for format/1 because it just changes the order that the metrics families are formatted in - it's just concatenating the wrong way. But using format_into/3 with a custom formatting function doesn't work properly. I'll send a follow-up PR (edit: #197)
This is a somewhat small set of patches to
prometheus_text_formatwhich aim to reduce garbage creation during registry formatting. Reducing garbage creation drives down the cost to the VM of scraping large registries - both in terms of peak memory allocation and also the work that the garbage collector must do.With these changes I see reduction in allocation reported by
tprofin a stress test of one of RabbitMQ's most expensive registries. In a test against single-instance RabbitMQ brokers on EC2 instances this saves a noticeable amount of peak memory and reduces CPU utilization significantly.tprof testing instructions
https://github.com/rabbitmq/rabbitmq-servercd rabbitmq-servermake depsmake run-brokerrabbitmq-serverrepo,sbin/rabbitmqctl import_definitions path/to/100k-classic-queues.jsonpointing to this definitions file.make run-brokerterminal, starttproftracing for new processes:tprof:start(#{type => call_memory}), tprof:enable_trace(new), tprof:set_pattern('_', '_', '_').curl -v localhost:15692/metrics/per-object --output /dev/nulltprof:format(tprof:inspect(tprof:collect())).To test this change, Ctrlc twice out of
make broker,cd deps/prometheusand check out this branch. Thenrm -rf ebinin that directory,cd ../../and repeat steps 4, 6, 7 and 8 again (skipping definitions import).Registry collection tprof measurement before this change...
Registry collection tprof measurement after this change...
So with this change, the Cowboy request process in charge of this endpoint allocates
147_597_866words instead of193_184_463, a reduction of45_586_597words or 23.6%.Stress-testing on EC2...
On EC2 I have two
m7g.xlargeinstances running RabbitMQ:galacticawhich carries this change andkestrelwhich usesprometheusat v5.1.1 (latest version RabbitMQ has adopted). A third instancecurls these instances at an interval of two seconds with this script:This asynchronously fires off a scrape request every two seconds for twenty minutes. The third node runs this script against both
galacticaandkestrelat the same time. The third node also scrapes these nodes'node_exportermetrics and RabbitMQ prometheus endpoint for Erlang allocator metrics.kestrel(baseline)Instance-wide memory usage
Instance-wide CPU usage
Erlang allocators
galactica(this branch)Instance-wide memory usage
Instance-wide CPU usage
Erlang allocators
We can see
kestrel(baseline) pinned at around 95% CPU usage consistently, hovering at around 9-10 GB instance-wide memory usage and the VM aware of 3.5-4.5 GB of usage. Andgalactica(this branch) sitting at 50% CPU usage, around 7.5-8.5 GB instance-wide memory and the VM tracking around 2-3 GB of memory.While the peak memory usage is reduced nicely, the main benefit is the CPU is loaded much less than before - I assume from performing less garbage collection.