OpenCensus tracing support added to Kubelet #1
OpenCensus tracing support added to Kubelet #1Monkeyanator wants to merge 2 commits intomaster-freezefrom
Conversation
dashpole
left a comment
There was a problem hiding this comment.
a couple of nits, and a question about how to deal with spans that don't actually "do" anything.
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
Use kl.nodeName here instead of getting it from the os
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
nit (means a small change that doesn't block lgtm): no spaces between error generating functions and error handling
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
Does this belong inside or outside the if updateType == kubetypes.SyncPodKill { statement? I.E. is it permissible to skip a child span entirely if it isn't a kill pod operation?
If my pod doesn't use any volumes, will I still get a tiny span that says "Mounting Volumes"? That would be confusing to me if I didn't know how this was implemented.
There was a problem hiding this comment.
That's a great point, definitely would be confusing to get a tiny irrelevant span. My thought process was that there might be some logic associated with deciding whether various processes should be carried out (determining whether volume needs mounting, determining whether some pod should be killed) that might be useful to trace, but probably not in this case.
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
nit: generally, only modify lines where you are making code changes. Also, compare diffs yourself before sending them as a PR.
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
same question about span being inside the if statement
pkg/kubelet/kubelet.go
Outdated
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
instead of producing a new context, can we overrwrite the ctx variable each time? That would make the code less brittle, and easier to add new spans if we need to in the future.
|
minor note, but it looks like you didn't regenerate the dependency Licenses. Should be some script in |
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
This is just tracing the creation of the kubernetes event, which probably isn't useful. I would probably remove this altogether, or put the tracing within the networkErrors() function.
There was a problem hiding this comment.
or rather, it doesn't need to be in the function, but around it here.
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
does this still work if we don't overwrite the ctx? Previously, you passed the context returned from the StartSpan for kill pod to the next StartSpan call.
There was a problem hiding this comment.
In this case, since we return no matter what at the end of this block, I believe we can just use the existing context to begin the span and don't need to redefine ctx
pkg/kubelet/kubelet.go
Outdated
There was a problem hiding this comment.
again, this is just tracing the creation of the event. It needs to be around the makePodDataDirs function.
|
ok, this LGTM. Can you squash your commits down to two? I.E. one with all of the godep and license changes, and one with the code changes. For documentation purposes (and so you can link it for people who ask), adding a screenshot or two of what these traces look like in stackdriver would also be helpful. |
to reflect these dependencies
…traces into Stackdriver backend
15939e8 to
6ffa27d
Compare
| // if we want to kill a pod, do it now! | ||
| if updateType == kubetypes.SyncPodKill { | ||
|
|
||
| _, podKillSpan := trace.StartSpan(ctx, "Pod kill logic") |
There was a problem hiding this comment.
Don't use regular sentences as span names.
Span names are statistically interesting identifier to represent a task.
For example,
_, podKillSpan := trace.StartSpan(ctx, "kubernetes.podkill")
would be a better fit. The comment applies to all the spans.
There was a problem hiding this comment.
This is the right way to construct a span name.
| // if we want to kill a pod, do it now! | ||
| if updateType == kubetypes.SyncPodKill { | ||
|
|
||
| _, podKillSpan := trace.StartSpan(ctx, "Pod kill logic") |
There was a problem hiding this comment.
This is the right way to construct a span name.
| kl.recorder.Eventf(pod, v1.EventTypeWarning, events.NetworkNotReady, "%s: %v", NetworkNotReadyErrorMsg, rs) | ||
| return fmt.Errorf("%s: %v", NetworkNotReadyErrorMsg, rs) | ||
| } | ||
| networkConditionSpan.End() |
There was a problem hiding this comment.
Probably here you need the ctx to also revert the context to the previous value, otherwise the next span cgroupSpan will be a child of the networkConditionSpan. Same everywhere you have this problem.
What this PR does / why we need it:
This PR introduces a proof of concept implementation for tracing internal components of Kubernetes with OpenCensus. This small sample constructs a somewhat arbitrary trace from the syncPod routine of the Kubelet, tags the trace with the host being run on, and annotates various events within the syncPod routine.
While the implementation in the PR is limited to Stackdriver, the use of OpenCensus makes it trivial to port these traces to Zipkin, Jaeger, or various other supported backends.
The traces for the Kubelet, in their current state, look as follows:
These traces are can be queried per node, or per latency percentile. Stackdriver provides the following useful view which could prove useful in diagnosing common causes for high latency:
Which issue(s) this PR fixes
There have been various issues related to the need for tracing within Kubernetes components:
kubernetes#26507
kubernetes#8806
kubernetes#815
Special notes for your reviewer:
Had to make version changes to protobuf and oauth to make the newly added dependencies work; this PR is more of a sample than something to be merged at the moment.