-
Couldn't load subscription status.
- Fork 43
Description
I'm observing a somewhat similar situation to grafana/beyla#1228 in a test opentelemetry-demo deployment + Beyla.
Rough timeline:
- Beyla starts (privileged container)
- otel demo starts, amongst other things a
valkey-cartcontainer - initially traces from this container result in traces with
service.name == "otel-demo/Deployment/valkey-cart" - after quite a long time (I believe around 1-2 days) of uninterrupted running from both processes, Beyla appeared to "lose" the tagging information, and spans were sent with
service.name == "valkey-cart" - this persisted until the otel-demo deployment was fully restarted.
Extra context is that this test deployment was a little "troubled" initially, so valkey-cart was crash-looping and was redeployed multiple times. In Beyla logs, I can see this situation (limited to valkey-cart PIDs for clarity):
- PID 786043 instrumented: Sep 16 at 11:54:34 <- stable valkey pid, tracked for 2 days
- PID 825135 ended: Sep 16 at 15:49:28 <- first "lingering" stuck pod dies
- PID 912010 ended: Sep 17 at 02:05:51
- PID 913843 ended: Sep 17 at 02:17:19
- PID 1000051 ended: Sep 17 at 11:02:35
- PID 786043 ended: Sep 18 at 12:54:28 <- full otel demo restart
Tracking this change, it appears that the "cutoff" happens at exactly 15:49 on Sep 16th, coinciding with the first "lingering" PID seen as ended:
From a (rather uneducated) reading of the code, this appears to be a suspect:
Reading previous pull requests, it appears that a similar bit of code was previously removed to fix a bug:
https://github.com/grafana/beyla/pull/1156/files
Perhaps reintroducing this behavior is a regression?
Please let me know if I can provide any more information that would be helpful. Sudden service name changes are quite detrimental to building e.g. meaningful alerting on top of them without workarounds. 🙏