This software is for Alpha preview only. This code may be discontinued, include breaking changes and may require code changes to use it.
Provide a stable, extensible core abstraction (GenAI Types + TelemetryHandler + CompositeEmitter + Evaluator hooks) separating instrumentation capture from telemetry flavor emission so that:
- Instrumentation authors create neutral GenAI data objects once.
- Different telemetry flavors (semantic conventions, vendor enrichments, events vs attributes, aggregated evaluation results, cost / agent metrics) are produced by pluggable emitters without touching instrumentation code.
- Evaluations (LLM-as-a-judge, quality metrics) run asynchronously and re-emit results through the same handler/emitter pipeline.
- Third parties can add / replace / augment emitters in well-defined category chains.
- Configuration is primarily environment-variable driven; complexity is opt-in.
Non-goal: Replace the OpenTelemetry SDK pipeline. Emitters sit above the SDK using public Span / Metrics / Logs / Events APIs.
Implemented dataclasses (in types.py):
GenAI- base classLLMInvocationEmbeddingInvocationWorkflowAgentInvocationStepToolCallEvaluationResult
Base dataclass: – fields include timing (start_time, end_time), identity (run_id, parent_run_id), context (provider, framework, agent_*, system, conversation_id, data_source_id), plus attributes: dict[str, Any] for free-form metadata.
Semantic attributes: fields tagged with metadata={"semconv": <attr name>} feed semantic_convention_attributes() which returns only populated values; emitters rely on this reflective approach (no hard‑coded attribute lists).
Messages: InputMessage / OutputMessage each hold role and parts (which may be Text, ToolCall, ToolCallResponse, or arbitrary parts). Output messages include finish_reason.
EvaluationResult fields: metric_name, optional score (float), label (categorical outcome), explanation, error (contains type, message), attributes (additional evaluator-specific key/values). No aggregate wrapper class yet.
TelemetryHandler provides external APIs for GenAI Types lifecycle
Capabilities:
- Type-specific lifecycle:
start_llm,stop_llm,fail_llm, plusstart/stop/failfor embedding, tool call, workflow, agent, step. - Generic dispatchers:
start(obj),finish(obj),fail(obj, error). - Dynamic content capture refresh (
_refresh_capture_content) each LLM / agentic start (re-reads env + experimental gating). - Delegation to
CompositeEmitter(on_start,on_end,on_error,on_evaluation_results). - Completion callback registry (
CompletionCallback); Evaluation Manager auto-registers if evaluators present. - Evaluation emission via
evaluation_results(invocation, list[EvaluationResult]).
Invocation objects hold a span reference.
EmitterProtocol offers: on_start(obj), on_end(obj), on_error(error, obj), on_evaluation_results(results, obj=None).
EmitterMeta supplies role, name, optional override, and a default handles(obj) returning True. Role names are informational and may not match category names (e.g., MetricsEmitter.role == "metric").
Defines ordered category dispatch with explicit sequences:
- Start order:
span,metrics,content_events - End/error order:
evaluation,metrics,content_events,span(span ends last so other emitters can enrich attributes first; evaluation emitters appear first in end sequence to allow flush behavior).
Public API (current): iter_emitters(categories), emitters_for(category), add_emitter(category, emitter). A richer register_emitter(..., position, mode) API is not yet implemented.
Entry point group: opentelemetry_util_genai_emitters (vendor packages contribute specs).
EmitterSpec fields:
namecategory(span,metrics,content_events,evaluation)factory(context)mode(append,prepend,replace-category,replace-same-name)after,before(ordering hints – currently unused / inert)invocation_types(allow-list; implemented via dynamichandleswrapping)
Ordering hints will either gain a resolver or be removed (open item).
Baseline selection: OTEL_INSTRUMENTATION_GENAI_EMITTERS (comma-separated tokens):
span(default)span_metricspan_metric_event- Additional tokens -> extra emitters (e.g.
traceloop_compat). If the only token istraceloop_compat, semconv span is suppressed (only_traceloop_compat).
Category overrides (OTEL_INSTRUMENTATION_GENAI_EMITTERS_<CATEGORY> with <CATEGORY> = SPAN|METRICS|CONTENT_EVENTS|EVALUATION) support directives: append:, prepend:, replace: (alias for replace-category), replace-category:, replace-same-name:.
Implemented through EmitterSpec.invocation_types; configuration layer replaces/augments each emitter’s handles method to short‑circuit dispatch cheaply. No explicit positional insertion API yet; runtime additions can call add_emitter (append only).
Supported modes: append, prepend, replace-category (alias replace), replace-same-name. Ordering hints (after / before) are present but inactive.
CompositeEmitter wraps all emitter calls; failures are debug‑logged. Error metrics hook (genai.emitter.errors) is not yet implemented (planned enhancement).
Emits semantic attributes, optional input/output message content, system instructions, function definitions, token usage, and agent context. Finalization order ensures attributes set before span closure.
Records durations and token usage to histograms: gen_ai.client.operation.duration, gen_ai.client.token.usage, plus agentic histograms (gen_ai.workflow.duration, gen_ai.agent.duration, gen_ai.step.duration). Role string is metric (singular) – may diverge from category name metrics.
Emits one structured log record summarizing an entire LLM invocation (inputs, outputs, system instructions) — a deliberate deviation from earlier message-per-event concept to reduce event volume. Agent/workflow/step event emission is commented out (future option).
Always present:
EvaluationMetricsEmitter– fixed histograms:gen_ai.evaluation.relevancegen_ai.evaluation.hallucinationgen_ai.evaluation.sentimentgen_ai.evaluation.toxicitygen_ai.evaluation.bias(Legacy dynamicgen_ai.evaluation.score.<metric>instruments removed.)
EvaluationEventsEmitter– event perEvaluationResult; optional legacy variant viaOTEL_GENAI_EVALUATION_EVENT_LEGACY.
Aggregation flag affects batching only (emitters remain active either way).
Emitted attributes (core):
gen_ai.evaluation.name– metric namegen_ai.evaluation.score.value– numeric score (events only; histogram carries values)gen_ai.evaluation.score.label– categorical label (pass/fail/neutral/etc.)gen_ai.evaluation.score.units– units of the numeric score (currentlyscore)gen_ai.evaluation.passed– boolean derived when label clearly indicates pass/fail (e.g.pass,success,fail); numeric-only heuristic currently disabled to prevent ambiguous semantics- Agent/workflow identity:
gen_ai.agent.name,gen_ai.agent.id,gen_ai.workflow.idwhen available.
An example of the third-party emitter:
- Splunk evaluation aggregation / extra metrics (
opentelemetry-util-genai-emitters-splunk).
| Variable | Purpose | Notes |
|---|---|---|
OTEL_INSTRUMENTATION_GENAI_EMITTERS |
Baseline + extras selection | Values: span, span_metric, span_metric_event, plus extras |
OTEL_INSTRUMENTATION_GENAI_EMITTERS_<CATEGORY> |
Category overrides | Directives: append / prepend / replace / replace-category / replace-same-name |
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT |
Enable/disable message capture | Truthy enables capture; default disabled |
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODE |
SPAN_ONLY or EVENT_ONLY or SPAN_AND_EVENT or NONE |
Defaults to SPAN_AND_EVENT when capture enabled |
OTEL_INSTRUMENTATION_GENAI_EVALS_EVALUATORS |
Evaluator config grammar | Evaluator(Type(metric(opt=val))) syntax supported |
OTEL_INSTRUMENTATION_GENAI_EVALS_RESULTS_AGGREGATION |
Aggregate vs per-evaluator emission | Boolean |
OTEL_INSTRUMENTATION_GENAI_EVALS_INTERVAL |
Eval worker poll interval | Default 5.0 seconds |
OTEL_INSTRUMENTATION_GENAI_EVALUATION_SAMPLE_RATE |
Trace-id ratio sampling | Float (0–1], default 1.0 |
OTEL_GENAI_EVALUATION_EVENT_LEGACY |
Emit legacy evaluation event shape | Adds second event per result |
- Parse baseline & extras.
- Register built-ins (span/metrics/content/evaluation).
- Load entry point emitter specs & register.
- Apply category overrides.
- Instantiate
CompositeEmitterwith resolved category lists.
EmitterSpec.invocation_types drives dynamic handles wrapper (fast pre-dispatch predicate). Evaluation emitters see results independently of invocation type filtering.
Note: Evaluators depend on opentelemetry-util-genai-evals to be installed as a completion_callback.
Evaluator package entry point groups:
opentelemetry_util_genai_completion_callbacks(completion callback plug-ins; evaluation manager registers here).opentelemetry_util_genai_evaluators(per-evaluator factories/registrations discovered by the evaluation manager).
Default loading honours two environment variables:
OTEL_INSTRUMENTATION_GENAI_COMPLETION_CALLBACKS– optional comma-separated filter applied before instantiation.OTEL_INSTRUMENTATION_GENAI_DISABLE_DEFAULT_COMPLETION_CALLBACKS– when truthy, skips loading built-in callbacks (e.g., evaluation manager).
Evaluation Manager behaviour (shipped from opentelemetry-util-genai-evals):
- Instantiated lazily when the evaluation completion callback binds to
TelemetryHandler. - Trace-id ratio sampling via
OTEL_INSTRUMENTATION_GENAI_EVALUATION_SAMPLE_RATE(falls back to enqueue if span context missing). - Parses evaluator grammar into per-type plans (metric + options) sourced from registered evaluators.
- Aggregation flag merges buckets into a single list when true (
OTEL_INSTRUMENTATION_GENAI_EVALS_RESULTS_AGGREGATION). - Emits lists of
EvaluationResulttohandler.evaluation_results. - Marks invocation
attributes["gen_ai.evaluation.executed"] = Truepost emission.
start_* -> CompositeEmitter.on_start(span, metrics, content_events)
finish_* -> CompositeEmitter.on_end(evaluation, metrics, content_events, span)
-> completion callbacks (Evaluation Manager enqueues)
Evaluation worker -> evaluate -> handler.evaluation_results(list) -> CompositeEmitter.on_evaluation_results(evaluation)
| Scenario | Configuration | Outcome |
|---|---|---|
| Add Traceloop compat span | OTEL_INSTRUMENTATION_GENAI_EMITTERS=span,traceloop_compat |
Semconv + compat span |
| Only Traceloop compat span | OTEL_INSTRUMENTATION_GENAI_EMITTERS=traceloop_compat |
Compat span only |
| Replace evaluation emitters | OTEL_INSTRUMENTATION_GENAI_EMITTERS_EVALUATION=replace:SplunkEvaluationAggregator |
Only Splunk evaluation emission |
| Prepend custom metrics | OTEL_INSTRUMENTATION_GENAI_EMITTERS_METRICS=prepend:MyMetrics |
Custom metrics run first |
| Replace content events | OTEL_INSTRUMENTATION_GENAI_EMITTERS_CONTENT_EVENTS=replace:VendorContent |
Vendor events only |
| Agent-only cost metrics | (future) programmatic add with invocation_types filter | Metrics limited to agent invocations |
- Emitters sandboxed (exceptions suppressed & debug logged).
- No error metric yet (planned:
genai.emitter.errors). - Content capture gated by experimental opt-in to prevent accidental large data egress.
- Single content event per invocation reduces volume.
- Invocation-type filtering occurs before heavy serialization.
emitters/utils.py includes: semantic attribute filtering, message serialization, enumeration builders (prompt/completion), function definition mapping, finish-time token usage application. Truncation / hashing helpers & PII redaction are not yet implemented (privacy work deferred).
- Implement ordering resolver for
after/beforehints. - Programmatic rich registration API (mode + position) & removal.
- Error metrics instrumentation.
- Aggregated
EvaluationResultswrapper (with evaluator latency, counts). - Privacy redaction & size-limiting/truncation helpers.
- Async emitters & dynamic hot-reload (deferred).
- Backpressure strategies for high-volume content events.
Get the packages installed:
Setup a virtual env (Note: will erase your .venv in the current folder)
deactivate ; rm -rf .venv; python --version ; python -m venv .venv && . .venv/bin/activate && python -m ensurepip && python -m pip install --upgrade pip && python -m pip install pre-commit -c dev-requirements.txt && pre-commit install && python -m pip install rstcheckpip install -e util/opentelemetry-util-genai --no-deps
pip install -e util/opentelemetry-util-genai-evals --no-deps
pip install -e util/opentelemetry-util-genai-evals-deepeval --no-deps
pip install -e util/opentelemetry-util-genai-emitters-splunk --no-deps
pip install -e instrumentation-genai/opentelemetry-instrumentation-langchain --no-deps
pip install -r dev-genai-requirements.txt
pip install -r instrumentation-genai/opentelemetry-instrumentation-langchain/examples/manual/requirements.txt
export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental
export OTEL_INSTRUMENTATION_GENAI_EMITTERS=span_metric_event,splunk
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODE=EVENT_AND_SPAN
export OTEL_INSTRUMENTATION_GENAI_EVALS_EVALUATORS="Deepeval(LLMInvocation(bias,toxicity))"
export OTEL_INSTRUMENTATION_GENAI_EVALS_RESULTS_AGGREGATION=trueSudo-code to add manual instrumentation to your app:
from opentelemetry.util.genai.handler import get_telemetry_handler
from opentelemetry.util.genai.types import LLMInvocation, InputMessage, OutputMessage, Text
handler = get_telemetry_handler()
inv = LLMInvocation(request_model="gpt-4", input_messages=[InputMessage(role="user", parts=[Text("Hello")])], provider="openai")
handler.start_llm(inv)
inv.output_messages = [OutputMessage(role="assistant", parts=[Text("Hi!")], finish_reason="stop")]
handler.stop_llm(inv)- Unit tests: env parsing, category overrides, evaluator grammar, sampling, content capture gating.
- Future: ordering hints tests once implemented.
- Smoke: vendor emitters (Traceloop + Splunk) side-by-side replacement/append semantics.