add OTEL tracing + make targets + scripts#139
Conversation
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry (OTel) tracing across the VM/container startup path, including relaying VM-originated spans back to the host, plus developer tooling (Jaeger + Mermaid Gantt) and an integration test to visualize traces.
Changes:
- Introduce a VM-side TTRPC traces service + channel-backed OTel span exporter to stream spans out of the VM.
- Instrument key host/shim + VM code paths with spans and propagate context over TTRPC.
- Add tracing integration test, Jaeger/Makefile helpers, Mermaid conversion script, and tracing documentation.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| plugins/vminit/tracing/plugin.go | Registers VM tracing plugin and installs OTel tracer provider/exporter. |
| internal/vminit/tracing/service.go | TTRPC streaming service that forwards exported spans to the host. |
| internal/vminit/tracing/exporter.go | Channel-backed OTel SpanExporter implementation for VM span buffering. |
| internal/vminit/task/service.go | Adds VM-side spans around Create/Start and networking wait/connect. |
| internal/vminit/runc/container.go | Adds VM-side spans around container creation and mount setup. |
| internal/vminit/process/init.go | Adds VM-side spans around crun create/start operations. |
| internal/vminit/ctrnetworking/ctrnetworking.go | Adds VM-side spans per-network setup within errgroup goroutines. |
| internal/vm/libkrun/instance.go | Adds host-side spans for VMStart/WaitForTTRPC + TTRPC client interceptor. |
| internal/shim/task/tracing.go | Implements host-side relay of VM span stream to an OTLP HTTP exporter. |
| internal/shim/task/service.go | Adds shim spans and starts VM trace relay during Create. |
| integration/tracing_test.go | Build-tagged integration test that exports spans to Jaeger and relays VM spans. |
| hack/trace-to-mermaid.py | Fetches latest Jaeger trace and renders it as a Mermaid Gantt chart. |
| docs/tracing.md | Documentation for running tracing workflow with Jaeger + Mermaid. |
| cmd/vminitd/main.go | Ensures tracing plugin is loaded by vminitd. |
| api/proto/nerdbox/services/traces/v1/traces.proto | New Traces service + Span/KeyValue messages for VM→host span streaming. |
| api/services/traces/v1/traces.pb.go | Generated protobuf types for traces service. |
| api/services/traces/v1/traces_ttrpc.pb.go | Generated TTRPC bindings for traces service. |
| api/services/traces/v1/doc.go | Package doc for traces API. |
| Makefile | Adds test-tracing, Jaeger lifecycle, UI open, and Gantt generation targets. |
| go.mod | Adds OTel SDK + OTLP HTTP exporter dependencies. |
| go.sum | Records new module checksums for added dependencies. |
Comments suppressed due to low confidence (1)
internal/vm/libkrun/instance.go:337
- This wait loop has multiple
returnpaths before the explicitttrpcWaitSpan.End()call, so the WaitForTTRPC span may never be ended/exported. Use adefer ttrpcWaitSpan.End()right after starting the span, or restructure to end the span before each early return.
_, ttrpcWaitSpan := tracer.Start(ctx, "libkrun.WaitForTTRPC")
startedAt := time.Now()
for {
select {
case err := <-errC:
if err != nil {
return fmt.Errorf("failure running vm: %w", err)
}
case <-ctx.Done():
return ctx.Err()
case <-time.After(time.Millisecond):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
a7fd67c to
830e59d
Compare
830e59d to
ca1697e
Compare
ca1697e to
f06a31f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
internal/vm/libkrun/instance.go:341
- Spans started for VMStart/WaitForTTRPC are only ended on the success path. If Start() returns early (errC error, ctx.Done(), or vmStartTimeout), these spans never end, which produces incomplete traces and can leak span state. Consider deferring span.End() immediately after each tracer.Start (or using a single parent span with defer) so all exit paths close the spans.
tracer := otel.Tracer("nerdbox")
_, krunStartSpan := tracer.Start(ctx, "libkrun.VMStart")
errC := make(chan error)
go func() {
defer close(errC)
if err := v.vmc.Start(); err != nil {
errC <- err
}
}()
v.shutdownCallbacks = []func(context.Context) error{
func(context.Context) error {
cerr := v.vmc.Shutdown()
select {
case err := <-errC:
if err != nil {
return fmt.Errorf("failure running vm: %w", err)
}
default:
}
return cerr
},
}
var conn net.Conn
// Initial TTRPC ping deadline. On Windows, the vsock listen-mode proxy
// has more overhead (host→Unix socket→vsock→guest→vsock→Unix socket→host)
// so we start with a longer deadline.
d := 2 * time.Millisecond
if runtime.GOOS == "windows" {
d = 500 * time.Millisecond
}
_, ttrpcWaitSpan := tracer.Start(ctx, "libkrun.WaitForTTRPC")
startedAt := time.Now()
for {
select {
case err := <-errC:
if err != nil {
return fmt.Errorf("failure running vm: %w", err)
}
case <-ctx.Done():
return ctx.Err()
case <-time.After(time.Millisecond):
}
if time.Since(startedAt) > vmStartTimeout {
log.G(ctx).WithField("timeout", vmStartTimeout).Warn("Timeout while waiting for VM to start")
return fmt.Errorf("VM did not start within %s", vmStartTimeout)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f06a31f to
b45ef30
Compare
b45ef30 to
d5c5c51
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d5c5c51 to
f70b6f4
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7087d9a to
415b643
Compare
415b643 to
378cb97
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
internal/vm/libkrun/instance.go:304
errCis unbuffered and the VM start goroutine does a blockingerrC <- err. IfStartreturns early (e.g.,ctx.Done()or timeout) before receiving fromerrC, that goroutine can deadlock forever on send (and never reachclose(errC)), leaking the goroutine and potentially hanging shutdown logic that expects the channel to close. MakeerrCbuffered (size 1) and/or make the send non-blocking so errors can't block when the caller has already returned.
errC := make(chan error)
go func() {
defer close(errC)
if err := v.vmc.Start(); err != nil {
errC <- err
}
}()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
378cb97 to
e82db3d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 16 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ffb3c23 to
99e7a9f
Compare
99e7a9f to
9c9950f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add OpenTelemetry spans to the host-side VM startup path (libkrun.VMStart, libkrun.WaitForTTRPC) with developer tooling: Jaeger Makefile targets, a Mermaid Gantt chart generator, tracing integration test, and docs. Signed-off-by: David Scott <dave@recoil.org>
Add a ttrpc Traces streaming service inside the VM that exports OTel spans over vsock to the host shim. The shim relays them to the OTLP collector with clock-skew correction. Includes VM-side spans for container create, start, mount setup, and networking, plus the proto definitions and generated code. Signed-off-by: David Scott <dave@recoil.org>
Register a host-side OTel tracing plugin in the shim that configures a TracerProvider with an OTLP HTTP exporter, enabling end-to-end trace context propagation across the shim's ttrpc calls. Signed-off-by: David Scott <dave@recoil.org>
9c9950f to
82c8526
Compare
Help us understand (and later optimise) VM/container start time.
If you run this:
It produces this:
The biggest thing on the critical path is the VM start + initial connection, which needs breaking down more (future investigation)
We can also share Mermaid diagrams which can be rendered on GitHub by running:
which produces this:
gantt dateFormat x axisFormat %s.%L s title Trace for nerdbox-vm section VM.NewInstance+Start nerdbox/VM.NewInstance+Start :VM_NewInstance+Start_9f6fda, 0, 83 . nerdbox/libkrun.VMStart :libkrun_VMStart_90885a, 5, 82 . nerdbox/libkrun.WaitForTTRPC :libkrun_WaitForTTRPC_10af7c, 5, 82 . nerdbox/TTRPC.Task.Create :TTRPC_Task_Create_5e832c, 83, 85 . . nerdbox/containerd.task.v3.Task/Create :containerd_task_v3_Task_Create_3a4175, 84, 85 . . . containerd.task.v3.Task/Create :containerd_task_v3_Task_Create_61f5b3, 83, 85 . . . . task.Create :task_Create_a0b789, 84, 85 . . . . . runc.NewContainer :runc_NewContainer_f4a190, 84, 85 . . . . . . crun.create :crun_create_beaf4f, 84, 85 . nerdbox/TTRPC.Task.Start :TTRPC_Task_Start_c8de57, 85, 87 . . nerdbox/containerd.task.v3.Task/Start :containerd_task_v3_Task_Start_f0139b, 85, 87 . . . containerd.task.v3.Task/Start :containerd_task_v3_Task_Start_1d5297, 85, 87 . . . . task.Start :task_Start_84961b, 85, 87 . . . . . . container.Start :container_Start_5372d6, 85, 87 . . . . . . . crun.start :crun_start_ce160c, 85, 87The first patch adds the infra and tracing for the host part. This gets most of the benefit IMHO.
The second patch extends the tracing to the VM: this is a bit complicated because I didn't want to assume there was a network, so the spans are relayed back to the host and then forwarded to the collector.
The clock is interesting because the RTC only has 1s granularity, so to correct we send the host timestamp to the VM before the tracing.