Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
eddb812
feat: ReconcileUtils for strongly consistent updates (#3106)
csviri Jan 15, 2026
47a1614
feat: observability with otel and default grafana dashboard
csviri Feb 4, 2026
e7d6101
wip
csviri Feb 4, 2026
008dcb7
wip
csviri Feb 4, 2026
758e31d
wip
csviri Feb 4, 2026
229b310
wip
csviri Feb 8, 2026
af72325
wip
csviri Feb 8, 2026
db73e14
wip
csviri Feb 8, 2026
7dac554
wip
csviri Feb 9, 2026
8f4af67
wip
csviri Feb 9, 2026
7163cb0
wip
csviri Feb 9, 2026
770b51a
wip
csviri Feb 9, 2026
c47923f
wip
csviri Feb 9, 2026
be8ea21
wip
csviri Feb 9, 2026
6205a3e
wip
csviri Feb 9, 2026
346ee9b
wip
csviri Feb 9, 2026
726bcc1
wip
csviri Feb 9, 2026
2b814c9
improve: micrometer metrics improvements
csviri Feb 9, 2026
ea84c96
wip
csviri Feb 10, 2026
cc99410
wip
csviri Feb 10, 2026
f6899bc
wip
csviri Feb 10, 2026
66188ef
wip
csviri Feb 10, 2026
cb6b877
wip
csviri Feb 10, 2026
9752024
wip
csviri Feb 10, 2026
53542c2
wip
csviri Feb 11, 2026
f9e0163
wip
csviri Feb 11, 2026
886274c
wip
csviri Feb 11, 2026
6c80029
wip
csviri Feb 11, 2026
9ad6cab
e2e test skeleton
csviri Feb 11, 2026
a5254b3
wip
csviri Feb 12, 2026
d8777b1
wip
csviri Feb 17, 2026
f3e35c2
wip
csviri Feb 21, 2026
58862e3
wip
csviri Feb 27, 2026
a9bcc77
wip
csviri Feb 27, 2026
c3a397d
wip
csviri Feb 27, 2026
43401f1
wip
csviri Feb 27, 2026
3af9310
wip
csviri Feb 27, 2026
3ccd430
wip
csviri Feb 27, 2026
3fffab4
wip
csviri Feb 28, 2026
e52bd37
wip
csviri Mar 1, 2026
a751867
documentation update
csviri Mar 1, 2026
4fc3069
wip
csviri Mar 1, 2026
a3d935d
logging
csviri Mar 1, 2026
e6ad757
Update sample-operators/metrics-processing/src/main/java/io/javaopera…
csviri Mar 1, 2026
5f46c5c
Update sample-operators/metrics-processing/pom.xml
csviri Mar 1, 2026
23416f1
Update sample-operators/metrics-processing/pom.xml
csviri Mar 1, 2026
08f33db
Update operator-framework-core/src/main/java/io/javaoperatorsdk/opera…
csviri Mar 1, 2026
e5acc60
Update observability/install-observability.sh
csviri Mar 1, 2026
4eda27b
wip
csviri Mar 1, 2026
6282ca4
Update sample-operators/metrics-processing/src/main/resources/io/java…
csviri Mar 1, 2026
0074f84
wip
csviri Mar 1, 2026
e2c4751
wip
csviri Mar 1, 2026
0d1a23c
wip
csviri Mar 1, 2026
6ceb74d
wip
csviri Mar 1, 2026
45fc814
wip
csviri Mar 1, 2026
cf114e3
wip
csviri Mar 1, 2026
a804406
Refinements on metrics
csviri Mar 3, 2026
fbb67b6
wip
csviri Mar 3, 2026
42e2649
docs improvement
csviri Mar 3, 2026
ebac5e2
fix: add deprecation information
metacosm Mar 4, 2026
810ba90
refactor: consistent constant definition
metacosm Mar 4, 2026
995ace7
refactor: reuse available methods to help inlining
metacosm Mar 4, 2026
4a2ec96
refactor: avoid creating intermediate collection
metacosm Mar 4, 2026
05de31a
refactor: remove unused constant
metacosm Mar 4, 2026
392f40e
fixed from code review
csviri Mar 4, 2026
2b86f00
wip
csviri Mar 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/e2e-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ jobs:
- "sample-operators/tomcat-operator"
- "sample-operators/webpage"
- "sample-operators/leader-election"
- "sample-operators/metrics-processing"
runs-on: ubuntu-latest
steps:
- name: Checkout
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ on:
paths-ignore:
- 'docs/**'
- 'adr/**'
- 'observability/**'
workflow_dispatch:
jobs:
check_format_and_unit_tests:
Expand Down
125 changes: 101 additions & 24 deletions docs/content/en/docs/documentation/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,30 +77,108 @@ Metrics metrics; // initialize your metrics implementation
Operator operator = new Operator(client, o -> o.withMetrics(metrics));
```

### Micrometer implementation
### MicrometerMetricsV2

The micrometer implementation is typically created using one of the provided factory methods which, depending on which
is used, will return either a ready to use instance or a builder allowing users to customize how the implementation
behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
could lead to performance issues.
[`MicrometerMetricsV2`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java)
is the recommended micrometer-based implementation. It is designed with low cardinality in mind:
all meters are scoped to the controller, not to individual resources. This avoids unbounded cardinality growth as
resources come and go.

To create a `MicrometerMetrics` implementation that behaves how it has historically behaved, you can just create an
instance via:
The simplest way to create an instance:

```java
MeterRegistry registry; // initialize your registry implementation
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry).build();
```

Optionally, include a `namespace` tag on per-reconciliation counters (disabled by default to avoid unexpected
cardinality increases in existing deployments):

```java
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
.withNamespaceAsTag()
.build();
```

You can also supply a custom timer configuration for `reconciliations.execution.duration`:

```java
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
.withExecutionTimerConfig(builder -> builder.publishPercentiles(0.5, 0.95, 0.99))
.build();
```

The class provides factory methods which either return a fully pre-configured instance or a builder object that will
allow you to configure more easily how the instance will behave. You can, for example, configure whether the
implementation should collect metrics on a per-resource basis, whether associated meters should be removed when a
resource is deleted and how the clean-up is performed. See the relevant classes documentation for more details.
#### MicrometerMetricsV2 metrics

All meters use `controller.name` as their primary tag. Counters optionally carry a `namespace` tag when
`withNamespaceAsTag()` is enabled.

| Meter name (Micrometer) | Type | Tags | Description |
|--------------------------------------|---------|---------------------------------------------------|------------------------------------------------------------------|
| `reconciliations.active` | gauge | `controller.name` | Number of reconciler executions currently executing |
| `reconciliations.queue` | gauge | `controller.name` | Number of resources currently queued for reconciliation |
| `custom_resources` | gauge | `controller.name` | Number of custom resources tracked by the controller |
| `reconciliations.execution.duration` | timer | `controller.name` | Reconciliation execution duration with explicit bucket histogram |
| `reconciliations.started.total` | counter | `controller.name`, `namespace`* | Number of reconciliations started (including retries) |
| `reconciliations.success.total` | counter | `controller.name`, `namespace`* | Number of successfully finished reconciliations |
| `reconciliations.failure.total` | counter | `controller.name`, `namespace`* | Number of failed reconciliations |
| `reconciliations.retries.total` | counter | `controller.name`, `namespace`* | Number of reconciliation retries |
| `events.received` | counter | `controller.name`, `event`, `action`, `namespace` | Number of Kubernetes events received by the controller |

\* `namespace` tag is only included when `withNamespaceAsTag()` is enabled.

The execution timer uses explicit boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure
compatibility with `histogram_quantile()` queries in Prometheus. This is important when using the OpenTelemetry Protocol (OTLP) registry, where
`publishPercentileHistogram()` would otherwise produce Base2 Exponential Histograms that are incompatible with classic
`_bucket` queries.

> **Note on Prometheus metric names**: The exact Prometheus metric name suffix depends on the `MeterRegistry` in use.
> For `PrometheusMeterRegistry` the timer is exposed as `reconciliations_execution_duration_seconds_*`. For
> `OtlpMeterRegistry` (metrics exported via OpenTelemetry Collector), it is exposed as
> `reconciliations_execution_duration_milliseconds_*`.

#### Grafana Dashboard

A ready-to-use Grafana dashboard is available at
[`observability/josdk-operator-metrics-dashboard.json`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/observability/josdk-operator-metrics-dashboard.json).
It visualizes all of the metrics listed above, including reconciliation throughput, error rates, queue depth, active
executions, resource counts, and execution duration histograms and heatmaps.

The dashboard is designed to work with metrics exported via OpenTelemetry Collector to Prometheus, as set up by the
observability sample (see below).

#### Exploring metrics end-to-end

The
[`metrics-processing` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing)
includes a full end-to-end test,
[`MetricsHandlingE2E`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java),
that:

1. Installs a local observability stack (Prometheus, Grafana, OpenTelemetry Collector) via
`observability/install-observability.sh`. That imports also the Grafana dashboards.
2. Runs two reconcilers that produce both successful and failing reconciliations over a sustained period
3. Verifies that the expected metrics appear in Prometheus

This is a good starting point for experimenting with the metrics and the Grafana dashboard in a real cluster without
having to deploy your own operator.

### MicrometerMetrics (Deprecated)

> **Deprecated**: `MicrometerMetrics` (V1) is deprecated as of JOSDK 5.3.0. Use `MicrometerMetricsV2` instead.
> V1 attaches resource-specific metadata (name, namespace, etc.) as tags to every meter, which causes unbounded
> cardinality growth and can lead to performance issues in your metrics backend.

The legacy `MicrometerMetrics` implementation is still available. To create an instance that behaves as it historically
has:

```java
MeterRegistry registry; // initialize your registry implementation
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
```

For example, the following will create a `MicrometerMetrics` instance configured to collect metrics on a per-resource
basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.
To collect metrics on a per-resource basis, deleting the associated meters after 5 seconds when a resource is deleted,
using up to 2 threads:

```java
MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
Expand All @@ -109,9 +187,9 @@ MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
.build();
```

### Operator SDK metrics
#### Operator SDK metrics (V1)

The micrometer implementation records the following metrics:
The V1 micrometer implementation records the following metrics:

| Meter name | Type | Tag names | Description |
|-------------------------------------------------------------|----------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
Expand All @@ -130,12 +208,11 @@ The micrometer implementation records the following metrics:
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |

As you can see all the recorded metrics start with the `operator.sdk` prefix. `<resource metadata>`, in the table above,
refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
could be summed up as follows: `group?, version, kind, [name, namespace?], scope` where the tags in square
brackets (`[]`) won't be present when per-resource collection is disabled and tags followed by a question mark are
omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
names are prefixed with `resource.`. This prefix might be removed in a future version for greater consistency.
All V1 metrics start with the `operator.sdk` prefix. `<resource metadata>` refers to resource-specific metadata and
depends on the considered metric and how the implementation is configured: `group?, version, kind, [name, namespace?],
scope` where tags in square brackets (`[]`) won't be present when per-resource collection is disabled and tags followed
by a question mark are omitted if the value is empty. In the context of controllers' execution metrics, these tag names
are prefixed with `resource.`.

### Aggregated Metrics

Expand Down
26 changes: 24 additions & 2 deletions docs/content/en/docs/migration/v5-3-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: Migrating from v5.2 to v5.3
---


## Renamed JUnit Module
## Rename of JUnit module

If you use JUnit extension in your test just rename it from:

Expand All @@ -26,4 +26,26 @@ to
<version>5.3.0<version>
<scope>test</scope>
</dependency>
```
```

## Metrics interface changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would technically be an API break and would require a new major version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strinctly sepaking yes, but such minor API changes we do some times, see the migration document. As other frameworks sometimes. It is basically I think a better choice in terms of tradeoff, since because we don't really want to increase the major verion that often and we on the other hand we have quite an amount of APIs, that sometimes better to evolve this way IMO.

I also was trying to do backwards compatible, we still could. But at the end it looked like that it would be more confusing, that just having a table to be able to easily migrate from current impl. If that makes sense.


The [Metrics](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/monitoring/Metrics.java)
interface changed in non backwards compatible way, in order to make the API cleaner:

The following table shows the relevant method renames:

| v5.2 method | v5.3 method |
|------------------------------------|------------------------------|
| `reconcileCustomResource` | `reconciliationSubmitted` |
| `reconciliationExecutionStarted` | `reconciliationStarted` |
| `reconciliationExecutionFinished` | `reconciliationSucceeded` |
| `failedReconciliation` | `reconciliationFailed` |
| `finishedReconciliation` | `reconciliationFinished` |
| `cleanupDoneFor` | `cleanupDone` |
| `receivedEvent` | `eventReceived` |


Other changes:
- `reconciliationFinished(..)` method is extended with `RetryInfo`
- `monitorSizeOf(..)` method is removed.
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@

import static io.javaoperatorsdk.operator.api.reconciler.Constants.CONTROLLER_NAME;

/**
* @deprecated Use {@link MicrometerMetricsV2} instead
*/
@Deprecated(forRemoval = true)
public class MicrometerMetrics implements Metrics {

private static final String PREFIX = "operator.sdk.";
Expand Down Expand Up @@ -68,7 +72,6 @@ public class MicrometerMetrics implements Metrics {
private static final String EVENTS_RECEIVED = "events.received";
private static final String EVENTS_DELETE = "events.delete";
private static final String CLUSTER = "cluster";
private static final String SIZE_SUFFIX = ".size";
private static final String UNKNOWN_ACTION = "UNKNOWN";
private final boolean collectPerResourceMetrics;
private final MeterRegistry registry;
Expand Down Expand Up @@ -182,7 +185,7 @@ public <T> T timeControllerExecution(ControllerExecution<T> execution) {
}

@Override
public void receivedEvent(Event event, Map<String, Object> metadata) {
public void eventReceived(Event event, Map<String, Object> metadata) {
if (event instanceof ResourceEvent) {
incrementCounter(
event.getRelatedCustomResourceID(),
Expand All @@ -201,14 +204,14 @@ public void receivedEvent(Event event, Map<String, Object> metadata) {
}

@Override
public void cleanupDoneFor(ResourceID resourceID, Map<String, Object> metadata) {
public void cleanupDone(ResourceID resourceID, Map<String, Object> metadata) {
incrementCounter(resourceID, EVENTS_DELETE, metadata);

cleaner.removeMetersFor(resourceID);
}

@Override
public void reconcileCustomResource(
public void reconciliationSubmitted(
HasMetadata resource, RetryInfo retryInfoNullable, Map<String, Object> metadata) {
Optional<RetryInfo> retryInfo = Optional.ofNullable(retryInfoNullable);
incrementCounter(
Expand All @@ -228,19 +231,20 @@ public void reconcileCustomResource(
}

@Override
public void finishedReconciliation(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationSucceeded(HasMetadata resource, Map<String, Object> metadata) {
incrementCounter(ResourceID.fromResource(resource), RECONCILIATIONS_SUCCESS, metadata);
}

@Override
public void reconciliationExecutionStarted(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationStarted(HasMetadata resource, Map<String, Object> metadata) {
var reconcilerExecutions =
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
reconcilerExecutions.incrementAndGet();
}

@Override
public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Object> metadata) {
public void reconciliationFinished(
HasMetadata resource, RetryInfo retryInfo, Map<String, Object> metadata) {
var reconcilerExecutions =
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
reconcilerExecutions.decrementAndGet();
Expand All @@ -251,8 +255,8 @@ public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Ob
}

@Override
public void failedReconciliation(
HasMetadata resource, Exception exception, Map<String, Object> metadata) {
public void reconciliationFailed(
HasMetadata resource, RetryInfo retry, Exception exception, Map<String, Object> metadata) {
var cause = exception.getCause();
if (cause == null) {
cause = exception;
Expand All @@ -266,11 +270,6 @@ public void failedReconciliation(
Tag.of(EXCEPTION, cause.getClass().getSimpleName()));
}

@Override
public <T extends Map<?, ?>> T monitorSizeOf(T map, String name) {
return registry.gaugeMapSize(PREFIX + name + SIZE_SUFFIX, Collections.emptyList(), map);
}

private void addMetadataTags(
ResourceID resourceID, Map<String, Object> metadata, List<Tag> tags, boolean prefixed) {
if (collectPerResourceMetrics) {
Expand Down
Loading
Loading