feat(opentelemetry sink): add automatic native log and trace to OTLP conversion#24621
feat(opentelemetry sink): add automatic native log and trace to OTLP conversion#24621szibis wants to merge 64 commits intovectordotdev:masterfrom
Conversation
Add conversion from Vector's native flat log format to OTLP protobuf: - Value → PBValue converters (inverse of existing PBValue → Value) - native_log_to_otlp_request() for full event conversion - Safe extraction helpers with graceful error handling - Hex validation for trace_id (16 bytes) and span_id (8 bytes) - Severity inference from severity_text when number missing - Support for multiple timestamp formats (chrono, epoch, RFC3339) - Pre-allocation and inline hints for performance
Detect native log format and automatically convert to OTLP when: - Event does not contain 'resourceLogs' field (pre-formatted OTLP) - Works with any Vector source (file, socket, otlp with flat decoding) Maintains backward compatibility: - Pre-formatted OTLP events (use_otlp_decoding: true) encode via passthrough - Native events get automatic conversion to valid OTLP protobuf This eliminates the need for 50+ lines of complex VRL transformation.
Add integration and E2E tests: Unit/Integration tests (lib/codecs/tests/otlp.rs): - Basic encoding functionality - Error handling (invalid types, missing fields, malformed hex) - Source compatibility (file, syslog, modified OTLP) - Timestamp handling (seconds, nanos, RFC3339, chrono) - Severity inference from text - Message field fallbacks (.message, .body, .msg, .log) - Roundtrip encode/decode verification E2E tests (tests/e2e/opentelemetry/native/): - Native logs convert to valid OTLP - Service name preservation through conversion - Log body, severity, timestamps preserved - Custom attributes via VRL transforms - Correct event counting metrics
Add comprehensive benchmarks comparing encoding approaches: 1. NEW: Native → auto-convert → encode (this PR) 2. OLD: VRL transform simulation → encode (what users had before) 3. OLD: Passthrough only (pre-formatted OTLP) Results show 4.7x throughput improvement for batch operations: - NEW batch: 288 MiB/s - OLD VRL: 61 MiB/s Single event is 7.4% faster than VRL approach.
- Changelog fragment for release notes - Comprehensive documentation with mermaid diagrams - Before/after configuration examples - Field mapping reference - Performance comparison tables
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
Fix check-spelling CI failure by adding two domain-specific terms: - kvlist: OpenTelemetry KeyValueList type - xychart: Mermaid diagram chart type
Co-authored-by: May Lee <may.lee@datadoghq.com>
Co-authored-by: May Lee <may.lee@datadoghq.com>
Co-authored-by: May Lee <may.lee@datadoghq.com>
… encode Apply review feedback patterns from PR vectordotdev#24621: - Replace `as u64` with `u64::try_from().ok()` for timestamp conversion - Replace `as u64`/`as f64` with `u64::from()`/`f64::from()` for sample.rate - Remove unwrap() in Distribution bucket overflow guard, use saturating index clamping instead
…ecode path Replace bare 'u64 as i64' casts with i64::try_from().ok() in timestamp conversions for logs and spans decode paths. Values above i64::MAX (year 2262+) now gracefully fall back to current time or Value::Null instead of silently wrapping to negative timestamps. Also guards log record dropped_attributes_count with > 0 check to avoid inserting zero values, matching the scope dropped_attributes_count pattern. Fixes internal_log_rate_secs to internal_log_rate_limit (Vector convention).
kv_list_into_value was dropping KeyValue entries where kv.value was None (outer AnyValue wrapper missing). Now all entries are preserved as Null.
…elds in log conversion Add namespace-aware field extraction that checks both event root (Legacy namespace) and %metadata.opentelemetry.* (Vector namespace), ensuring round-trip compatibility for logs decoded with Vector namespace. Collect unrecognized event fields (e.g. user_id, request_id, hostname) into OTLP attributes instead of silently dropping them during native log-to-OTLP conversion.
…OTLP conversion Add 19 new tests covering: - Full OTLP field mapping (all fields set simultaneously) - Attribute value types (int, float, bool, array, nested object) - Body field priority (message > body > msg > log) - Structured object body → KvlistValue - Observed timestamp, flags, dropped_attributes_count - Scope with attributes - Remaining field dedup with explicit attributes - Null field filtering - All severity inference levels + case insensitivity - RFC3339 string and float timestamp parsing - Resource via alternative field names - Many custom fields from JSON/k8s sources - Vector namespace full metadata roundtrip
Mirror the log fix: collect unknown trace event fields (deployment_id, tenant, environment, etc.) as span attributes to prevent silent data loss during native→OTLP conversion. Add KNOWN_OTLP_SPAN_FIELDS list and collect_trace_remaining_fields helper. Include ingest_timestamp as known to avoid re-encoding the decode-path timestamp. Add 6 tests: unknown fields collected, known fields excluded, merge with explicit attributes, null filtering, type preservation, and ingest_timestamp exclusion.
…avior Fix scope.dropped_attributes_count: read from event/metadata instead of hard-coding 0, preserving round-trip fidelity. Add source_type and ingest_timestamp to known OTLP log fields to prevent Vector operational metadata from spilling into OTLP attributes. Document the automatic remaining-fields-to-attributes behavior in both the OtlpSerializer doc comments and the sink how_it_works section.
Extract scope.schema_url, resource schema_url, resource_dropped_attributes_count, and scope.dropped_attributes_count in the native-to-OTLP encode path. These fields are produced by the decode fix in vectordotdev#24905 — the encode now reads them when present and falls back to defaults (empty/0) when absent, ensuring full round-trip fidelity once vectordotdev#24905 merges while remaining backward-compatible before it does. Also fixes schema_url mapping: root "schema_url" now correctly maps to ResourceLogs/ResourceSpans.schema_url (resource level), while "scope.schema_url" maps to ScopeLogs/ScopeSpans.schema_url (scope level).
PR24905 changed resource metadata to flat paths (resource_schema_url, resource_dropped_attributes_count) instead of nesting under resources to avoid user attribute collisions. Update the encode path to read from the new flat metadata paths in Vector namespace.
Add edge_case_tests module to logs.rs with 10 roundtrip tests covering: - Array, nested kvlist, bool/int/double body types - Empty body, severity_number=0 (UNSPECIFIED) - Observed timestamp without timestamp - Unicode/emoji in body and attribute keys/values - Multiple log records per scope - u64::MAX timestamp overflow safety Fix clippy approx_constant errors in common.rs and logs.rs test values (3.14→1.23, 2.71828→9.81, 3.14159→1.23456).
Local Integration & Edge Case TestingSet up a local test pipeline — Vector OTLP source on ports 4317/4318, piped through an OTLP sink to a second Vector instance on 4319/4320 for full roundtrip testing. Used grpcurl with proto files from Fixed pre-existing clippy Added 10 edge case tests to PerformancePushed 10k log events and 10k trace events in batches of 100 through the roundtrip pipeline — logs at ~4.7k events/s, traces at ~5k events/s, zero errors. These numbers mostly reflect grpcurl client overhead — a compiled gRPC client would be significantly faster. The important takeaway is stability: no crashes, no data corruption, no memory issues under sustained load. |
…rdotdev#24897 - Update otlp-native-conversion.md to reflect metrics are supported via companion PR vectordotdev#24897 (was "planned for future release") - Add native metrics auto-conversion example config - Update CUE docs requirements to mention all three signals - Accept DataType::all_bits() in OtlpSerializerConfig::input_type() so metrics are not rejected at config validation time - Improve metric error message to reference vectordotdev#24897 and passthrough
fb03aed to
ca1a2e1
Compare
Avoids rebuilding the serializer and allocating a new BytesMut per iteration. Each benchmark now creates the encoder once and clears the buffer between iterations, giving more accurate per-encode measurements.
Summary
Add automatic conversion from Vector's native flat log and trace formats to OTLP (OpenTelemetry Protocol) format in the
opentelemetrysink'sotlpcodec.Problem: Users currently need 50+ lines of complex VRL to manually build the nested OTLP structure (
resourceLogs→scopeLogs→logRecordswithKeyValuearrays), and trace events fail entirely when sent to the OTLP sink without passthrough mode.Solution: The OTLP encoder now automatically detects native log and trace events and converts them to valid OTLP protobuf. Pre-formatted OTLP events continue to use passthrough encoding (backward compatible).
Scope
use_otlp_decoding: true)Performance Impact (Logs)
Architecture Comparison
%%{init: {'theme': 'base', 'themeVariables': { 'lineColor': '#000000', 'primaryTextColor': '#000000'}}}%% flowchart TB subgraph OLD["OLD: 50 lines VRL"] direction LR O1[Source] ==> O2[VRL Transform] O2 ==> O3[OTLP Encoder] O3 ==> O4[Collector] end subgraph NEW["NEW: Zero VRL"] direction LR N1[Source] ==> N2[OTLP Encoder] N2 ==> N3[Collector] end OLD ==> NEW style O1 fill:#ffffff,stroke:#000000,stroke-width:2px,color:#000000 style O2 fill:#cccccc,stroke:#000000,stroke-width:3px,color:#000000 style O3 fill:#ffffff,stroke:#000000,stroke-width:2px,color:#000000 style O4 fill:#ffffff,stroke:#000000,stroke-width:2px,color:#000000 style N1 fill:#ffffff,stroke:#000000,stroke-width:2px,color:#000000 style N2 fill:#999999,stroke:#000000,stroke-width:3px,color:#000000 style N3 fill:#ffffff,stroke:#000000,stroke-width:2px,color:#000000 linkStyle default stroke:#000000,stroke-width:2pxBefore vs After Comparison
use_otlp_decoding: false+codec: otlpcodec: otlpuse_otlp_decoding: true+codec: otlpcodec: otlpuse_otlp_decoding: true+codec: otlpuse_otlp_decoding: true+codec: otlpcodec: otlp.message→body.stringValue.severity_text→severityText.severity_number→severityNumber.attributes.*→logRecords[].attributes[].resources.*→resource.attributes[].trace_id→traceId.span_id→spanId.timestamp→timeUnixNano.trace_id→traceId(16 bytes).span_id→spanId(8 bytes).parent_span_id→parentSpanId.name→name.kind→kind.start_time_unix_nano/.end_time_unix_nano.attributes.*→attributes[].resources.*→resource.attributes[].events→events[](span events).links→links[](span links).status→status(message, code)Vector configuration
Before (Complex VRL Required)
50+ lines of VRL transformation
After (Zero VRL)
Traces (Zero VRL)
Native Metrics (Auto-Convert via #24897)
With Optional Enrichment
Supported Native Log Format
{ "message": "User login successful", "timestamp": "2024-01-15T10:30:00Z", "severity_text": "INFO", "severity_number": 9, "trace_id": "0123456789abcdef0123456789abcdef", "span_id": "fedcba9876543210", "attributes": { "user_id": "user-12345", "duration_ms": 42.5 }, "resources": { "service.name": "auth-service" }, "scope": { "name": "auth-module", "version": "1.0.0" } }Supported Native Trace Format
{ "trace_id": "0123456789abcdef0123456789abcdef", "span_id": "fedcba9876543210", "parent_span_id": "abcdef0123456789", "name": "HTTP GET /api/users", "kind": 2, "start_time_unix_nano": 1705312200000000000, "end_time_unix_nano": 1705312200042000000, "attributes": { "http.method": "GET", "http.status_code": 200 }, "resources": { "service.name": "api-gateway" }, "status": { "code": 1, "message": "OK" }, "events": [ { "name": "request.start", "time_unix_nano": 1705312200000000000, "attributes": { "component": "handler" } } ], "links": [] }Log Field Mapping
.message/.body/.msgbody.stringValue.timestamptimeUnixNano.severity_textseverityText.severity_numberseverityNumber.trace_idtraceId.span_idspanId.attributes.*attributes[].resources.*resource.attributes[].scope.namescope.name.scope.versionscope.versionTrace Field Mapping
.trace_idtraceId.span_idspanId.parent_span_idparentSpanId.namename.kindkind.start_time_unix_nanostartTimeUnixNano.end_time_unix_nanoendTimeUnixNano.trace_statetraceState.attributes.*attributes[].resources.*resource.attributes[].events[]events[].links[]links[].status.codestatus.code.status.messagestatus.message.dropped_attributes_countdroppedAttributesCount.dropped_events_countdroppedEventsCount.dropped_links_countdroppedLinksCountHow did you test this PR?
Unit Tests
Tests cover:
E2E Tests
cargo vdev test e2e opentelemetry-nativeDocker Compose with telemetrygen validates:
Benchmarks
Change Type
Is this a breaking change?
Pre-formatted OTLP events (with
resourceLogs/resourceSpans/resourceMetricsfields) continue using existing passthrough path. Native metric events return an explicit error with a clear message (same behavior as before for unsupported types).Does this PR include user facing changes?
no-changeloglabel to this PR.References