-
Couldn't load subscription status.
- Fork 43
Description
Background
Service graph metrics are identified by the (service.name, service.namespace) tuples of the client and of the server.
If there are multiple instances of a client or a server, these instances are aggregated in to a single service graph time series.
In a traditional setup service graph metrics are generated from Spans. Metrics generation is performed by a central instance, like Tempo or a central OpenTelemetry collector. In that case, aggregating over instances is no problem, because the metrics generator has full access to Spans from all instances.
Problem
OBI's application_service_graph feature allows service graph metrics to be exposed directly, without the need to generate them from Spans.
However, that also means there is no central instance providing service graph metrics, each Beyla instance exposes metrics independently.
This is an issue if there is more than one instance of a client or a server on different hosts. In that case, the Beyla instances on the different hosts expose the same time series, identified by the same (service.name, service.namespace) tuples, and these time series will overwrite each other in the metrics backend.
Proposal
Add a unique Beyla instance identifier to service graph metrics so that service graph metrics provided by different Beyla instances cannot override each other.
The current service graph visualization in Grafana Tempo Explore uses this query:
sum by (client, server) (rate(traces_service_graph_request_server_seconds_sum[$__range]))
So adding a unique Beyla identifier should not harm, because the query aggregates it away.