Skip to content

add OTEL tracing + make targets + scripts#139

Open
djs55 wants to merge 3 commits intocontainerd:mainfrom
djs55:djs55/otel-tracing
Open

add OTEL tracing + make targets + scripts#139
djs55 wants to merge 3 commits intocontainerd:mainfrom
djs55:djs55/otel-tracing

Conversation

@djs55
Copy link
Copy Markdown
Contributor

@djs55 djs55 commented Mar 26, 2026

Help us understand (and later optimise) VM/container start time.

If you run this:

make test-tracing
make jaeger-open

It produces this:

Screenshot 2026-03-26 at 16 24 54

The biggest thing on the critical path is the VM start + initial connection, which needs breaking down more (future investigation)

We can also share Mermaid diagrams which can be rendered on GitHub by running:

make jaeger-gantt

which produces this:

gantt
    dateFormat x
    axisFormat %s.%L s
    title Trace for nerdbox-vm

    section VM.NewInstance+Start
    nerdbox/VM.NewInstance+Start :VM_NewInstance+Start_9f6fda, 0, 83
    . nerdbox/libkrun.VMStart :libkrun_VMStart_90885a, 5, 82
    . nerdbox/libkrun.WaitForTTRPC :libkrun_WaitForTTRPC_10af7c, 5, 82
    . nerdbox/TTRPC.Task.Create :TTRPC_Task_Create_5e832c, 83, 85
    . . nerdbox/containerd.task.v3.Task/Create :containerd_task_v3_Task_Create_3a4175, 84, 85
    . . . containerd.task.v3.Task/Create :containerd_task_v3_Task_Create_61f5b3, 83, 85
    . . . . task.Create :task_Create_a0b789, 84, 85
    . . . . . runc.NewContainer :runc_NewContainer_f4a190, 84, 85
    . . . . . . crun.create :crun_create_beaf4f, 84, 85
    . nerdbox/TTRPC.Task.Start :TTRPC_Task_Start_c8de57, 85, 87
    . . nerdbox/containerd.task.v3.Task/Start :containerd_task_v3_Task_Start_f0139b, 85, 87
    . . . containerd.task.v3.Task/Start :containerd_task_v3_Task_Start_1d5297, 85, 87
    . . . . task.Start :task_Start_84961b, 85, 87
    . . . . . . container.Start :container_Start_5372d6, 85, 87
    . . . . . . . crun.start :crun_start_ce160c, 85, 87
Loading

The first patch adds the infra and tracing for the host part. This gets most of the benefit IMHO.

The second patch extends the tracing to the VM: this is a bit complicated because I didn't want to assume there was a network, so the spans are relayed back to the host and then forwarded to the collector.

The clock is interesting because the RTC only has 1s granularity, so to correct we send the host timestamp to the VM before the tracing.

Copilot AI review requested due to automatic review settings March 26, 2026 16:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTel) tracing across the VM/container startup path, including relaying VM-originated spans back to the host, plus developer tooling (Jaeger + Mermaid Gantt) and an integration test to visualize traces.

Changes:

  • Introduce a VM-side TTRPC traces service + channel-backed OTel span exporter to stream spans out of the VM.
  • Instrument key host/shim + VM code paths with spans and propagate context over TTRPC.
  • Add tracing integration test, Jaeger/Makefile helpers, Mermaid conversion script, and tracing documentation.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
plugins/vminit/tracing/plugin.go Registers VM tracing plugin and installs OTel tracer provider/exporter.
internal/vminit/tracing/service.go TTRPC streaming service that forwards exported spans to the host.
internal/vminit/tracing/exporter.go Channel-backed OTel SpanExporter implementation for VM span buffering.
internal/vminit/task/service.go Adds VM-side spans around Create/Start and networking wait/connect.
internal/vminit/runc/container.go Adds VM-side spans around container creation and mount setup.
internal/vminit/process/init.go Adds VM-side spans around crun create/start operations.
internal/vminit/ctrnetworking/ctrnetworking.go Adds VM-side spans per-network setup within errgroup goroutines.
internal/vm/libkrun/instance.go Adds host-side spans for VMStart/WaitForTTRPC + TTRPC client interceptor.
internal/shim/task/tracing.go Implements host-side relay of VM span stream to an OTLP HTTP exporter.
internal/shim/task/service.go Adds shim spans and starts VM trace relay during Create.
integration/tracing_test.go Build-tagged integration test that exports spans to Jaeger and relays VM spans.
hack/trace-to-mermaid.py Fetches latest Jaeger trace and renders it as a Mermaid Gantt chart.
docs/tracing.md Documentation for running tracing workflow with Jaeger + Mermaid.
cmd/vminitd/main.go Ensures tracing plugin is loaded by vminitd.
api/proto/nerdbox/services/traces/v1/traces.proto New Traces service + Span/KeyValue messages for VM→host span streaming.
api/services/traces/v1/traces.pb.go Generated protobuf types for traces service.
api/services/traces/v1/traces_ttrpc.pb.go Generated TTRPC bindings for traces service.
api/services/traces/v1/doc.go Package doc for traces API.
Makefile Adds test-tracing, Jaeger lifecycle, UI open, and Gantt generation targets.
go.mod Adds OTel SDK + OTLP HTTP exporter dependencies.
go.sum Records new module checksums for added dependencies.
Comments suppressed due to low confidence (1)

internal/vm/libkrun/instance.go:337

  • This wait loop has multiple return paths before the explicit ttrpcWaitSpan.End() call, so the WaitForTTRPC span may never be ended/exported. Use a defer ttrpcWaitSpan.End() right after starting the span, or restructure to end the span before each early return.
	_, ttrpcWaitSpan := tracer.Start(ctx, "libkrun.WaitForTTRPC")
	startedAt := time.Now()
	for {
		select {
		case err := <-errC:
			if err != nil {
				return fmt.Errorf("failure running vm: %w", err)
			}
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(time.Millisecond):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@djs55 djs55 force-pushed the djs55/otel-tracing branch from a7fd67c to 830e59d Compare March 26, 2026 16:38
Copilot AI review requested due to automatic review settings March 26, 2026 16:43
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 830e59d to ca1697e Compare March 26, 2026 16:43
@djs55 djs55 force-pushed the djs55/otel-tracing branch from ca1697e to f06a31f Compare March 26, 2026 16:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

internal/vm/libkrun/instance.go:341

  • Spans started for VMStart/WaitForTTRPC are only ended on the success path. If Start() returns early (errC error, ctx.Done(), or vmStartTimeout), these spans never end, which produces incomplete traces and can leak span state. Consider deferring span.End() immediately after each tracer.Start (or using a single parent span with defer) so all exit paths close the spans.
	tracer := otel.Tracer("nerdbox")
	_, krunStartSpan := tracer.Start(ctx, "libkrun.VMStart")
	errC := make(chan error)
	go func() {
		defer close(errC)
		if err := v.vmc.Start(); err != nil {
			errC <- err
		}
	}()

	v.shutdownCallbacks = []func(context.Context) error{
		func(context.Context) error {
			cerr := v.vmc.Shutdown()
			select {
			case err := <-errC:
				if err != nil {
					return fmt.Errorf("failure running vm: %w", err)
				}
			default:
			}
			return cerr
		},
	}

	var conn net.Conn
	// Initial TTRPC ping deadline. On Windows, the vsock listen-mode proxy
	// has more overhead (host→Unix socket→vsock→guest→vsock→Unix socket→host)
	// so we start with a longer deadline.
	d := 2 * time.Millisecond
	if runtime.GOOS == "windows" {
		d = 500 * time.Millisecond
	}
	_, ttrpcWaitSpan := tracer.Start(ctx, "libkrun.WaitForTTRPC")
	startedAt := time.Now()
	for {
		select {
		case err := <-errC:
			if err != nil {
				return fmt.Errorf("failure running vm: %w", err)
			}
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(time.Millisecond):
		}
		if time.Since(startedAt) > vmStartTimeout {
			log.G(ctx).WithField("timeout", vmStartTimeout).Warn("Timeout while waiting for VM to start")
			return fmt.Errorf("VM did not start within %s", vmStartTimeout)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@djs55 djs55 force-pushed the djs55/otel-tracing branch from f06a31f to b45ef30 Compare March 26, 2026 16:55
Copilot AI review requested due to automatic review settings March 26, 2026 17:04
@djs55 djs55 force-pushed the djs55/otel-tracing branch from b45ef30 to d5c5c51 Compare March 26, 2026 17:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@djs55 djs55 force-pushed the djs55/otel-tracing branch from d5c5c51 to f70b6f4 Compare March 26, 2026 17:22
Copilot AI review requested due to automatic review settings March 26, 2026 18:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@djs55 djs55 force-pushed the djs55/otel-tracing branch from 7087d9a to 415b643 Compare March 26, 2026 19:06
Copilot AI review requested due to automatic review settings March 26, 2026 19:15
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 415b643 to 378cb97 Compare March 26, 2026 19:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

internal/vm/libkrun/instance.go:304

  • errC is unbuffered and the VM start goroutine does a blocking errC <- err. If Start returns early (e.g., ctx.Done() or timeout) before receiving from errC, that goroutine can deadlock forever on send (and never reach close(errC)), leaking the goroutine and potentially hanging shutdown logic that expects the channel to close. Make errC buffered (size 1) and/or make the send non-blocking so errors can't block when the caller has already returned.
	errC := make(chan error)
	go func() {
		defer close(errC)
		if err := v.vmc.Start(); err != nil {
			errC <- err
		}
	}()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@djs55 djs55 marked this pull request as ready for review March 26, 2026 21:47
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 378cb97 to e82db3d Compare April 1, 2026 10:45
Copilot AI review requested due to automatic review settings April 1, 2026 15:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 16 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@djs55 djs55 force-pushed the djs55/otel-tracing branch from ffb3c23 to 99e7a9f Compare April 3, 2026 14:03
Copilot AI review requested due to automatic review settings April 3, 2026 14:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

djs55 added 3 commits April 3, 2026 15:29
Add OpenTelemetry spans to the host-side VM startup path (libkrun.VMStart,
libkrun.WaitForTTRPC) with developer tooling: Jaeger Makefile targets,
a Mermaid Gantt chart generator, tracing integration test, and docs.

Signed-off-by: David Scott <dave@recoil.org>
Add a ttrpc Traces streaming service inside the VM that exports OTel spans
over vsock to the host shim. The shim relays them to the OTLP collector
with clock-skew correction. Includes VM-side spans for container create,
start, mount setup, and networking, plus the proto definitions and
generated code.

Signed-off-by: David Scott <dave@recoil.org>
Register a host-side OTel tracing plugin in the shim that configures
a TracerProvider with an OTLP HTTP exporter, enabling end-to-end trace
context propagation across the shim's ttrpc calls.

Signed-off-by: David Scott <dave@recoil.org>
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 9c9950f to 82c8526 Compare April 3, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants