Skip to content

K8s namespace and deployment type information lost when previous pod exits #656

@paweljw

Description

@paweljw

I'm observing a somewhat similar situation to grafana/beyla#1228 in a test opentelemetry-demo deployment + Beyla.

Rough timeline:

  • Beyla starts (privileged container)
  • otel demo starts, amongst other things a valkey-cart container
  • initially traces from this container result in traces with service.name == "otel-demo/Deployment/valkey-cart"
  • after quite a long time (I believe around 1-2 days) of uninterrupted running from both processes, Beyla appeared to "lose" the tagging information, and spans were sent with service.name == "valkey-cart"
  • this persisted until the otel-demo deployment was fully restarted.

Extra context is that this test deployment was a little "troubled" initially, so valkey-cart was crash-looping and was redeployed multiple times. In Beyla logs, I can see this situation (limited to valkey-cart PIDs for clarity):

  • PID 786043 instrumented: Sep 16 at 11:54:34 <- stable valkey pid, tracked for 2 days
  • PID 825135 ended: Sep 16 at 15:49:28 <- first "lingering" stuck pod dies
  • PID 912010 ended: Sep 17 at 02:05:51
  • PID 913843 ended: Sep 17 at 02:17:19
  • PID 1000051 ended: Sep 17 at 11:02:35
  • PID 786043 ended: Sep 18 at 12:54:28 <- full otel demo restart

Tracking this change, it appears that the "cutoff" happens at exactly 15:49 on Sep 16th, coinciding with the first "lingering" PID seen as ended:

Image

From a (rather uneducated) reading of the code, this appears to be a suspect:

https://github.com/grafana/opentelemetry-ebpf-instrumentation/blob/df9d1603b2bf061d059e2c213025cfdfde92fd23/pkg/components/kube/store.go#L229-L239

Reading previous pull requests, it appears that a similar bit of code was previously removed to fix a bug:

https://github.com/grafana/beyla/pull/1156/files

Perhaps reintroducing this behavior is a regression?

Please let me know if I can provide any more information that would be helpful. Sudden service name changes are quite detrimental to building e.g. meaningful alerting on top of them without workarounds. 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions