Skip to content

Conversation

@mabdinur
Copy link
Contributor

@mabdinur mabdinur commented Oct 29, 2025

What does this PR do?

Adds full OpenTelemetry Metrics support to dd-trace-js with a custom Meter Provider implementation. Enable with DD_METRICS_OTEL_ENABLED=true to export metrics via OTLP protocol.

Key Features:

  • Full OpenTelemetry Metrics API compliance with all standard instrument types (Counter, UpDownCounter, Histogram, Gauge, and Observable variants)
  • OTLP export via http/protobuf (default) or http/json protocols
  • Configurable endpoint, headers, timeout, and export intervals via standard OTEL_EXPORTER_OTLP_METRICS_* environment variables
  • Periodic metric collection and aggregation with support for DELTA, CUMULATIVE, and LOWMEMORY temporality modes
  • Comprehensive test coverage (91.69%) with 39 integration tests

Configuration:

  • DD_METRICS_OTEL_ENABLED - Enable OpenTelemetry metrics (default: false)
  • OTEL_EXPORTER_OTLP_METRICS_ENDPOINT - Endpoint URL (default: http://localhost:4318/v1/metrics)
  • OTEL_EXPORTER_OTLP_METRICS_PROTOCOL - Protocol: http/protobuf or http/json
  • OTEL_METRIC_EXPORT_INTERVAL - Export interval in ms (default: 60000)
  • Additional timeout and header configuration options

Motivation

Enables customers to use OpenTelemetry Metrics API with dd-trace-js without adding the OpenTelemetry SDK as a dependency. Custom implementation provides better integration with dd-trace-js configurations, avoids vendoring grpc libraries and maintains flexibility.

Additional Notes

mabdinur and others added 30 commits October 14, 2025 12:27
- Add metrics.proto and metrics_service.proto (OTLP v1 spec)
- Update protobuf_loader to support metrics protos
- Rename protos/ -> otlp/ directory for better organization
- Create OtlpHttpExporterBase for shared HTTP export logic
- Create OtlpTransformerBase for shared transformation logic
- Refactor logs exporter/transformer to extend base classes
- Update test mocking paths
- Eliminates ~400 lines of duplication
for (const item of items) {
const instrumentationScope = item.instrumentationScope || { name: '', version: '', schemaUrl: '', attributes: {} }
const attrsKey = stableStringify(instrumentationScope.attributes || {})
const attrsKey = JSON.stringify(instrumentationScope.attributes || {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If attributes are an object we must use a stable stringify implementation. Otherwise the keys would be stringified in insertion order and it would not match anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reintroduced the stableStringify implementation:

function stableStringify (attributes) {
  if (attributes == null || typeof attributes !== 'object') {
    return JSON.stringify(attributes)
  }
  // Attributes are sorted by key to ensure consistent serialization regardless of key order.
  // Keys are always strings and values are always strings, numbers, booleans, 
  // or arrays of strings, numbers, or booleans.
  return Object.keys(attributes)
    .sort()
    .map(key => `${key}:${JSON.stringify(attributes[key])}`)
    .join(',')
}

We typically expect each metric to have 1–5 attributes. Since Datadog bills based on the number of unique metric + attribute combinations, adding more tags can dramatically increase metric cardinality and therefore costs. To prevent this, users will likely only set a few attributes per metric. So sorting attributes by key will likely add minimal overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a test case to catch this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the attributes are already guaranteed to be an object, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup they should be. I can get rid of the additional check

Copy link
Collaborator

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM :)

@mabdinur mabdinur requested a review from BridgeAR November 17, 2025 19:22
@mabdinur mabdinur force-pushed the munir/add-otel-metrics-configs branch from 79e1be4 to ba3f4b1 Compare November 20, 2025 06:26
@mabdinur mabdinur enabled auto-merge (squash) November 21, 2025 00:51
Copy link
Collaborator

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits and potential follow-ups :)

const histogram = decoded.resourceMetrics[0].scopeMetrics[0].metrics[0]
assert.strictEqual(histogram.name, 'duration')
assert.strictEqual(histogram.histogram.dataPoints[0].count, 1)
assert.strictEqual(histogram.histogram.dataPoints[0].sum, 100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think checking for all properties of the histogram would be good.

Nit: Ideally, we use our assertObjectContains helper instead of having many individual assertions.

Comment on lines +187 to +191
setTimeout(() => {
assert(validated, 'Should have validated an export with all metrics')
validator()
done()
}, 150)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a timeout, it would be possible to just call done() from inside of the validator method. That way it is simpler and faster.

The same applies for all other methods with that pattern.

It is also more reliable, due to not being dependent on the timing anymore.

Or even better: the mockOtlpExport method returns a promise. That way you can trigger the methods sync and just return that promise afterwards. That would be very clean.

Comment on lines +577 to +584
assert(counter, 'Should have counter')
assert(histogram, 'Should have histogram')
assert.strictEqual(counter.name, 'test')
assert.strictEqual(histogram.name, 'test')
assert.strictEqual(counter.sum.dataPoints.length, 1, 'Counter should have 1 data point')
assert.strictEqual(histogram.histogram.dataPoints.length, 1, 'Histogram should have 1 data point')
assert.strictEqual(counter.sum.dataPoints[0].asInt, 5)
assert.strictEqual(histogram.histogram.dataPoints[0].sum, 100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have lots of code like that, while I would like to start prevent calling so many asserts on a single object. We should just use the partial helper (assertObjectContains, if I am not mistaken)

Comment on lines +498 to +504
let bucketIndex = DEFAULT_HISTOGRAM_BUCKETS.length
for (let i = 0; i < DEFAULT_HISTOGRAM_BUCKETS.length; i++) {
if (value <= DEFAULT_HISTOGRAM_BUCKETS[i]) {
bucketIndex = i
break
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this would use a binary search.

Comment on lines +485 to +496
if (!cumulativeState.has(stateKey)) {
cumulativeState.set(stateKey, {
count: 0,
sum: 0,
min: Infinity,
max: -Infinity,
bucketCounts: new Array(DEFAULT_HISTOGRAM_BUCKETS.length + 1).fill(0),
startTime: metric.temporality === TEMPORALITY.CUMULATIVE ? this.#startTime : timestamp
})
}

const state = cumulativeState.get(stateKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Ideally, it would just get the state if it was defined. If it was not defined, we could combine the handling to just set the first value.

That would be a tad faster.

const state = cumulativeState.get(stateKey)
state.value += value

const dataPoint = this.#findOrCreateDataPoint(metric, attributes, attrKey, () => ({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the idea around the method is an performance optimization to prevent the object from being created?

Instead of an object, we are now creating a method that is referencing these parts. I did not benchmark something like that before, I would expect the function creation still to take more time though. The pattern is only useful in case there would be many compute heavy operations.

sum: 0,
min: Infinity,
max: -Infinity,
bucketCounts: new Array(DEFAULT_HISTOGRAM_BUCKETS.length + 1).fill(0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a compute heavy operations (referencing a comment above about findOrCreateDataPoint and the method vs object creation).

Making this lazy is a good idea. To prevent the overhead, I would still create a simple object and just use optional assignment to assign the bucketCounts after the call. That way dataPoint will be updated due to the reference and it is cheap.

@mabdinur mabdinur merged commit ef50f2e into master Nov 22, 2025
870 of 873 checks passed
@mabdinur mabdinur deleted the munir/add-otel-metrics-configs branch November 22, 2025 16:20
dd-octo-sts bot pushed a commit that referenced this pull request Nov 23, 2025
…6783)

* add otlp metrics protos

* feat: add OTLP metrics proto definitions and reorganize directory

- Add metrics.proto and metrics_service.proto (OTLP v1 spec)
- Update protobuf_loader to support metrics protos
- Rename protos/ -> otlp/ directory for better organization

* refactor: extract common OTLP logic into base classes

- Create OtlpHttpExporterBase for shared HTTP export logic
- Create OtlpTransformerBase for shared transformation logic
- Refactor logs exporter/transformer to extend base classes
- Update test mocking paths
- Eliminates ~400 lines of duplication

* fix logs

* increase test coverage

* feat(metrics): add support for otel metrics provider

* feat(metrics): add support for otlp configurations

* updates to pass system tests

* add temperolality support and clean up implementation

* add support for encoding attributes

* simplify tests

* use real values in tests

* do not encode numbers as strings

* use enum in transformation

* remove unneed fields

* add better temporality support and include encoding for async metric types

* improve test coverage for scope attributes

* simplify stubs in tests

* validate scope attributes

* ruben comments

* fix broken number test

* clean up tests

* clean up instruments and update tests

* simplify meter

* use constants and implement missing callbacks

* move otlp_transformer changes over

* clean up private/public fields

* avoid redefining constant

* clean up tests

* update typing

* linting clean ups

* first round of changes from cr

* limit number of metrics in each batch

* round 3 changes

* address more comments part 4

* update doc strings and typing

* update max batch size to operate on the aggregate metrics

* avoid converting aggregated metrics to arrays, perf

* fix regression in logs implementation

* revert closure

* cleanup max measurement queue, jdocs and typing

* log warning if value is invalid

* final set of changes

* update tests to be compatible with master branch

* remove register from meter provider and simplify export

* clean up stable stringify and update limit of max queue size

* remove @Private from metrics docs

* only implement the meterprovider api, renove shutdown and forceflush operations

* remove observableInstruments

* lint

* fix linting failures

* fix linting failures

* fix linting 3
@dd-octo-sts dd-octo-sts bot mentioned this pull request Nov 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants