Skip to content

Conversation

@the-mikedavis
Copy link
Contributor

This is a somewhat small set of patches to prometheus_text_format which aim to reduce garbage creation during registry formatting. Reducing garbage creation drives down the cost to the VM of scraping large registries - both in terms of peak memory allocation and also the work that the garbage collector must do.

With these changes I see reduction in allocation reported by tprof in a stress test of one of RabbitMQ's most expensive registries. In a test against single-instance RabbitMQ brokers on EC2 instances this saves a noticeable amount of peak memory and reduces CPU utilization significantly.

tprof testing instructions
  1. Clone https://github.com/rabbitmq/rabbitmq-server
  2. cd rabbitmq-server
  3. make deps
  4. make run-broker
  5. In another terminal in the rabbitmq-server repo, sbin/rabbitmqctl import_definitions path/to/100k-classic-queues.json pointing to this definitions file.
  6. In the shell from the make run-broker terminal, start tprof tracing for new processes: tprof:start(#{type => call_memory}), tprof:enable_trace(new), tprof:set_pattern('_', '_', '_').
  7. In another terminal scrape the expensive endpoint: curl -v localhost:15692/metrics/per-object --output /dev/null
  8. When that's done, collect and format the sample: tprof:format(tprof:inspect(tprof:collect())).

To test this change, Ctrlc twice out of make broker, cd deps/prometheus and check out this branch. Then rm -rf ebin in that directory, cd ../../ and repeat steps 4, 6, 7 and 8 again (skipping definitions import).


Registry collection tprof measurement before this change...
****** Process <0.301089.0>  --  100.00% of total *** 
FUNCTION                                                                                   CALLS      WORDS    PER CALL  [    %]
... removed everything less than 1% ...
prometheus_text_format:render_labels/1                                                   2308195    1944642        0.84  [ 1.01]
erlang:atom_to_binary/2                                                                   651584    2375647        3.65  [ 1.23]
prometheus_rabbitmq_core_metrics_collector:'-emit_queue_info/3-fun-0-'/3                  100000    2500000       25.00  [ 1.29]
prometheus_model_helpers:counter_metric/2                                                 301325    3615900       12.00  [ 1.87]
prometheus_text_format:'-render_labels/1-fun-0-'/2                                        321434    4178642       13.00  [ 2.16]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^1/1-0-'/2             2300145    4400076        1.91  [ 2.28]
prometheus_model_helpers:'-metrics_from_tuples/2-lc$^0/1-0-'/2                           2308456    4616300        2.00  [ 2.39]
lists:'-filter/2-lc$^0/1-0-'/2                                                           2408461    4816304        2.00  [ 2.49]
erlang:integer_to_binary/1                                                               2206892    6620701        3.00  [ 3.43]
prometheus_rabbitmq_core_metrics_collector:label/1                                       2200038   11000022        5.00  [ 5.69]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^0/1-1-'/2             2300145   11500190        5.00  [ 5.95]
prometheus_text_format:'-emit_mf_metrics/2-fun-0-'/3                                     2308150   11541419        5.00  [ 5.97]
prometheus_model_helpers:gauge_metric/2                                                  2006812   24081744       12.00  [12.47]
prometheus_text_format:has_special_char/1                                               23475329   24147190        1.03  [12.50]
prometheus_text_format:render_series/3                                                   2308200   32511401       14.09  [16.83]
ets:match_object/2                                                                            19   38406095  2021373.42  [19.88]
                                                                                                  193184463              [100.0]

Registry collection tprof measurement after this change...
****** Process <0.401000.0>  --  99.99% of total *** 
FUNCTION                                                                                  CALLS      WORDS    PER CALL  [    %]
... removed everything less than 1% ...
prometheus_model_helpers:label_pair/1                                                    429393    1717572        4.00  [ 1.16]
prometheus_text_format:render_labels/1                                                  2308195    1944642        0.84  [ 1.32]
erlang:atom_to_binary/2                                                                  651584    2375647        3.65  [ 1.61]
prometheus_rabbitmq_core_metrics_collector:'-emit_queue_info/3-fun-0-'/3                 100000    2500000       25.00  [ 1.69]
prometheus_model_helpers:counter_metric/2                                                301325    3615900       12.00  [ 2.45]
prometheus_text_format:'-render_labels/1-fun-0-'/2                                       321434    4178642       13.00  [ 2.83]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^1/1-0-'/2            2300145    4400076        1.91  [ 2.98]
prometheus_model_helpers:'-metrics_from_tuples/2-lc$^0/1-0-'/2                          2308456    4616300        2.00  [ 3.13]
lists:'-filter/2-lc$^0/1-0-'/2                                                          2408461    4816304        2.00  [ 3.26]
erlang:integer_to_binary/1                                                              2206892    6620705        3.00  [ 4.49]
prometheus_rabbitmq_core_metrics_collector:label/1                                      2200038   11000022        5.00  [ 7.45]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^0/1-1-'/2            2300145   11500190        5.00  [ 7.79]
prometheus_text_format:render_series/4                                                  2308200   11541000        5.00  [ 7.82]
prometheus_text_format:render_value/2                                                   2308200   11543618        5.00  [ 7.82]
prometheus_model_helpers:gauge_metric/2                                                 2006812   24081744       12.00  [16.32]
ets:match_object/2                                                                           19   38406095  2021373.42  [26.02]
                                                                                                 147597866              [100.0]

So with this change, the Cowboy request process in charge of this endpoint allocates 147_597_866 words instead of 193_184_463, a reduction of 45_586_597 words or 23.6%.

Stress-testing on EC2...

On EC2 I have two m7g.xlarge instances running RabbitMQ: galactica which carries this change and kestrel which uses prometheus at v5.1.1 (latest version RabbitMQ has adopted). A third instance curls these instances at an interval of two seconds with this script:

#! /usr/bin/env bash

N=600
SLEEP=2
for i in $(seq 1 $N)
do
  echo "Sleeping ${SLEEP}s... ($i / $N)"
  sleep $SLEEP
  echo "Ask for metrics from $1... ($i / $N)"
  curl -s "http://$1:15692/metrics/per-object" --output /dev/null &
done

wait

This asynchronously fires off a scrape request every two seconds for twenty minutes. The third node runs this script against both galactica and kestrel at the same time. The third node also scrapes these nodes' node_exporter metrics and RabbitMQ prometheus endpoint for Erlang allocator metrics.

kestrel (baseline)

Instance-wide memory usage
grafana-kestrel-mem
Instance-wide CPU usage
grafana-kestrel-cpu
Erlang allocators
grafana-kestrel-erlang-alloc

galactica (this branch)

Instance-wide memory usage
grafana-galactica-mem
Instance-wide CPU usage
grafana-galactica-cpu
Erlang allocators
grafana-galactica-erlang-alloc

We can see kestrel (baseline) pinned at around 95% CPU usage consistently, hovering at around 9-10 GB instance-wide memory usage and the VM aware of 3.5-4.5 GB of usage. And galactica (this branch) sitting at 50% CPU usage, around 7.5-8.5 GB instance-wide memory and the VM tracking around 2-3 GB of memory.

While the peak memory usage is reduced nicely, the main benefit is the CPU is loaded much less than before - I assume from performing less garbage collection.

@NelsonVides
Copy link
Member

This looks amazing! I see it is still marked as draft so I guess no rush, and I'm away from the computer for a few days so only having a look at this on my phone now. Nevertheless, I'd love to see this ready and merged, thank you so much for the work 😃

@the-mikedavis
Copy link
Contributor Author

Yep no real rush on this! Looks like I have some work to do to make the CI happy anyways

@the-mikedavis the-mikedavis marked this pull request as ready for review November 14, 2025 03:36
@lhoguin
Copy link

lhoguin commented Nov 14, 2025

I've approved the workflow run.

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 97.72727% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/formats/prometheus_text_format.erl 97.56% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/prometheus_sup.erl 79.41% <100.00%> (+1.99%) ⬆️
src/formats/prometheus_text_format.erl 94.52% <97.56%> (-1.69%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still away from a proper computer so still quickly looking at this from my phone. Thanks a lot for having it pass CI!

I have a question, for good-looking code, the process dictionary's a bit ugly. Is there a way to refactor those usages to pass something more functional or does it have a performance impact that way?

Also for the compiled regex, perhaps storing it in a persistent term that is created at startup could help that perform even faster? It would literally be compiled only once through the entire VM lifetime and never require any GC.

If you don't see an easy way to improve that then it's probably fine to merge this way and later when I'm back to a full computer I'll try to refactor and tag you in a potential PR 🤔

@the-mikedavis
Copy link
Contributor Author

Yeah the process dictionary part here is really icky. I thought in my earlier testing that I saw persistent_term:get/2 allocating but, trying again, looks like I was wrong. So the binary:match/2 pattern we should definitely move into a persistent_term.

For the erase/1+put/2 dance in format_into/3 we can't be more functional because the Collector:collect_mf/2 callback returns ok, so we don't have a nice way to accumulate values. To fix that we'd need a really big breaking change :/

@the-mikedavis the-mikedavis force-pushed the md/opt branch 2 times, most recently from 850a174 to f7dc0e1 Compare November 19, 2025 22:27
Copy link
Member

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's so many checks against whether something is a binary and if it is not then iolist_to_binary, that I really wish everything had been just binaries to begin with. But that's a breaking change for another day 😄

Anyway, I have a couple of comments regarding a bit more crazy performance optimisations and making tracing more readable. It's not really important and if it is too annoying I could just shuffle this code myself in a couple of weeks. But maybe you like the idea or you didn't know those tricks and want to do it yourself so just sharing. WDYT? 🙂

`prometheus_text_format:has_special_char/1` is called very often when
a registry contains many metrics with label pairs. We can use
`binary:match/2` to search within a label binary for the special
characters (newline, backslash and double-quote) without allocation.

The old code using binary match syntax creates a match context every
time the function is called (except, not recursion - then the match
context is reused). A match context allocates 5 words to the process
heap when it is created. When matching many many binaries this scales to
create a noticeable amount of short-lived garbage.

In comparison `binary:match/2` with a precompiled match pattern does not
allocate. The BIF for it is also very well optimized, using `memchr`
since OTP 22.
@the-mikedavis
Copy link
Contributor Author

All of that sounds good to me - I applied all of the suggestions. For the render_metrics/3 one I tweaked it slightly so that the binary is always the first arg

Copy link
Member

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving it. Got one more question :D

Copy link
Member

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dirty stuff, this was there before your code changes anyway, it just now shows on the diff.

@the-mikedavis the-mikedavis force-pushed the md/opt branch 2 times, most recently from 06af086 to 715663d Compare November 21, 2025 16:33
The formatting callback for a registry can build each metrics family as
a single binary in order to reduce garbage. This mainly involves passing
the accumulator binary through all functions that append to it.

It's more efficient to append to the resulting binary than to allocate
smaller binaries and then append them. For example:

    <<Blob/binary, Name/binary, "_", Suffix/binary>>.
    %% versus
    Combined = <<Name/binary, "_", Suffix/binary>>,
    <<Blob/binary, Combined/binary>>.

The first expression generates less garbage than the second. A good
example of this was the `add_brackets/1` function which was inlined.
Inlining does not turn the first expression (above) into the second
according to the compiler unfortunately, so we pay the cost of creating
a binary with brackets and then formatting that into the larger blob,
rather than formatting in just by copying. This change manually inlines
`add_brackets/1` into its caller `render_series/4`.

This change also changes some list strings into binaries. Especially for
ASCII, strings binaries are _far_ more compact than lists. Lists need
two words per ASCII character - one for the character and one for the
tail pointer. So it's like UTF-32 but worse, basically UTF-128 on a 64
bit machine. ASCII or UTF-8 text in binaries takes one byte per
character in the binary's array, plus a word or two of metadata. E.g.
`<<"hello">>` allocates three words while `"hello"` allocates ten.
Building on the work in the parent commit, now that the data being
passed to the `ram_file` is a binary, we can instead build the entire
output gradually within the process. We pay in terms of I/O overhead
from writing and then reading from the `ram_file` since `ram_file` is a
port - all data is passed between the VM and the port driver. The memory
consumed by a port driver is also invisible to the VM's allocator, so
large port driver resource usage should be avoided where possible.

Instead this change refactors the `registry_collect_callback` to fold
over collectors and build an accumulator. The `create_mf` callback's
return of `ok` forces us to store this rather than pass and return it.
So it's a little less hygienic but is more efficient than passing data
in/out of a port.

This also introduces a function `format_into/3` which can use this
folding function directly. This can be used to avoid collecting the
entire response in one binary. Instead the response can be streamed
with `cowboy_req:stream_body/3` for example.
Copy link
Member

@NelsonVides NelsonVides left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A blast of a change! Thank you so much for all the effort and also, for keeping such a tidy git history :)

@NelsonVides NelsonVides merged commit 566e985 into prometheus-erl:master Nov 21, 2025
5 checks passed
@NelsonVides
Copy link
Member

@the-mikedavis gonna get a release to hex done over the weekend 👍🏽

@the-mikedavis the-mikedavis deleted the md/opt branch November 21, 2025 21:00
@the-mikedavis
Copy link
Contributor Author

Sweet, thanks @NelsonVides!

@NelsonVides
Copy link
Member

Published https://hex.pm/packages/prometheus/6.1.0 🎉

"\n"
>>,
Bin = render_metrics(Prologue, Name, Metrics),
put(?MODULE, Fmt(Bin, erase(?MODULE)))
Copy link
Contributor Author

@the-mikedavis the-mikedavis Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, I made a mistake here with the order of the arguments. The function should take the state as the first argument and then the new data as the second argument. It doesn't end up making a difference for format/1 because it just changes the order that the metrics families are formatted in - it's just concatenating the wrong way. But using format_into/3 with a custom formatting function doesn't work properly. I'll send a follow-up PR (edit: #197)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants