MON-4414: Optional Monitoring Capability #1880

rexagod · 2025-10-28T02:04:36Z

Goes over the details for the OptionalMonitoring capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.

openshift-ci-robot · 2025-10-28T02:04:40Z

@rexagod: This pull request references MON-4414 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

Details

In response to this:

Goes over the details for the OptionalMonitoring capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-10-28T02:04:40Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-10-28T02:04:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign simonpasquier for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

enhancements/monitoring/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

simonpasquier · 2025-11-05T16:21:20Z

/unassign @moadz
/cc @jan--f

rexagod · 2025-12-02T08:44:55Z

enhancements/monitoring/optional-monitoring-capability.md

+4. Should we downscale Prometheus to a single replica when the
+capability is disabled?
+    > Yes, since the monitoring footprint is reduced significantly
+    when the capability is disabled, moving away from an HA setup
+    to a single replica setup makes sense from a resource consumption
+    perspective. All components across OpenShift that rely on Thanos
+    will need to be "taught" to query the single Prometheus replica
+    directly instead of going through Thanos Querier.


(noting here that in this case, we will not support Genie; PTAL at this discussion)

simonpasquier

Good start! Let's focus on the overview and external picture first before going to the implementation details :)

simonpasquier · 2025-12-02T15:24:46Z

enhancements/monitoring/optional-monitoring-capability.md

+(a) by "optional" we mean components, or parts of components, that
+are not required for [telemetry operations], and,
+(b) the capability will be enabled by default (implicity enabled),
+to preserve the historical UX.


Suggested change

to preserve the historical UX.

for backward compatibility and because monitoring is a key feature of OpenShift.

simonpasquier · 2025-12-02T15:26:29Z

enhancements/monitoring/optional-monitoring-capability.md

+Based on this information, we can code certain behaviors in the
+monitoring operator that wouldn't otherwise make sense, and would
+help reduce the overall monitoring footprint not just across the
+stack, but the cluster itself (since we'd be sure of the intent),


(nit) "platform" (in the sense of OpenShift Container Platform) rather than "cluster"?

simonpasquier · 2025-12-02T15:26:58Z

enhancements/monitoring/optional-monitoring-capability.md

+life-cycled, monitored and remediated at scale.
+-->
+
+> As a cluster administrator, I want to be able to disable as much


(nit)

Suggested change

> As a cluster administrator, I want to be able to disable as much

> As a cluster administrator, I want to disable as much

simonpasquier · 2025-12-02T15:27:58Z

enhancements/monitoring/optional-monitoring-capability.md

+
+> As a cluster administrator, I want to be able to disable as much
+of the monitoring footprint as possible without breaking the cluster,
+including any managed manifests, so that I can minimize the resource


not sure what you mean by "managed manifests"? I'd suggest to remove the "without breaking the cluster" since in all cases we don't want that to happen.

simonpasquier · 2025-12-02T15:29:11Z

enhancements/monitoring/optional-monitoring-capability.md

+of the monitoring footprint as possible without breaking the cluster,
+including any managed manifests, so that I can minimize the resource
+consumption of monitoring on my cluster.
+> As a cluster administrator, I want to be able to disable as much


maybe merge this intent with the previous one?
can we add an "openshift dev" story about not losing the telemetry signal?

simonpasquier · 2025-12-02T15:56:53Z

enhancements/monitoring/optional-monitoring-capability.md

+    only, it makes sense to enable the "telemetry" collection profile
+    when the capability is disabled. This allows us to actually regulate
+    the metrics ingestion from exporters that would otherwise push data
+    into Prometheus that is not telemetry-related.


related to my question above on layered operators, I wonder what will be the consequences when monitoring=disabled and telemetry=disabled. Do they need to implement all profiles (including telemetry) to get a signal?

simonpasquier · 2025-12-02T15:58:35Z

enhancements/monitoring/optional-monitoring-capability.md

+    when the capability is disabled, moving away from an HA setup
+    to a single replica setup makes sense from a resource consumption
+    perspective. All components across OpenShift that rely on Thanos
+    will need to be "taught" to query the single Prometheus replica


Assuming that the data they're interested in is available. I'd prefer to declare that when monitoring is disabled, we pretend to expose no service.

simonpasquier · 2025-12-02T15:59:08Z

enhancements/monitoring/optional-monitoring-capability.md

+what should be the behavior?
+    > Since telemetry is orthogonal to the capability itself, opting
+    out of telemetry should lead to disabling all telemetry-related
+    components (e.g., exporters, telemetry-specific `PrometheusRules`


one could argue that exporters should be on.

simonpasquier · 2025-12-02T15:59:54Z

enhancements/monitoring/optional-monitoring-capability.md

+    exception, and will remain enabled to ensure that the autoscaling
+    pipelines are not affected.
+
+6. Should the capability be named `OptionalMonitoring` or just `Monitoring`?


I'm still not convinced by OptionalMonitoring. What about PlatformMonitoring?

simonpasquier · 2025-12-02T16:04:41Z

enhancements/monitoring/optional-monitoring-capability.md

+
+4. Should we downscale Prometheus to a single replica when the
+capability is disabled?
+    > Yes, since the monitoring footprint is reduced significantly


Are we ok with the fact that we may have lower availability for subscription usage?

Goes over the details for the `OptionalMonitoring` capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.

Add `Telemetry` collection profile to the set of offered collection profiles in CMO.

openshift-ci · 2025-12-16T15:16:04Z

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

simonpasquier · 2025-12-24T14:46:01Z

enhancements/monitoring/optional-monitoring-capability.md

+Disabling the capability translates to hypershift clusters pointing
+[`METRICS_SET` environment variable] to `Telemetry`, in order to
+minimize the monitoring footprint while ensuring that telemetry
+operations are not affected.


I'm not sure that disabling capability should have an impact on the METRICS_SET parameter. An operator may decide to turn off OCP monitoring and collect hosted cluster metrics by other means?

simonpasquier · 2025-12-24T14:47:53Z

enhancements/monitoring/optional-monitoring-capability.md

+behaviors also be exposed to MicroShift admins through the
+configuration file for MicroShift?
+-->
+


(nit) can we add a separate section for single-node openshift? Basically we can state that it would help reduce consumption of SNO clusters.

simonpasquier · 2025-12-24T14:55:07Z

enhancements/monitoring/optional-monitoring-capability.md

+DPUs exist specifically to offload infrastructure overhead from host x86
+servers, achieving up to 70% CPU savings on the host. Running full monitoring
+stacks on the DPU ARM cores would defeat this purpose by consuming the limited
+resources meant for high-performance networking and infrastructure services.


"defeat" is a strong statement IMHO. From our conversations, I have the feeling that the current architecture has limited resources for workloads other than pure networking functions which drives a need for limiting the resource usage of platform components (especially with hosted control planes).

The consequence for this proposal is that when both RH telemetry and the capability are disabled, CMO shouldn't deploy Prometheus at all.

simonpasquier · 2025-12-24T14:56:07Z

enhancements/monitoring/optional-monitoring-capability.md

+        - `ClusterOperator` status: Expose whether the capability is
+          enabled or disabled through a `ClusterOperator` condition.
+* Teach other components about the capability:
+    - Hypershift: Setting the `METRICS_SET` environment variable


see comment above

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 28, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 28, 2025

rexagod changed the title ~~MON-4414: OEP for optional monitoring~~ MON-4414: Optional Monitoring Capability Nov 3, 2025

rexagod marked this pull request as ready for review November 3, 2025 21:03

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 3, 2025

openshift-ci bot requested review from moadz and simonpasquier November 3, 2025 21:04

openshift-ci bot requested a review from jan--f November 5, 2025 16:21

rexagod commented Dec 2, 2025

View reviewed changes

simonpasquier reviewed Dec 2, 2025

View reviewed changes

rexagod added 4 commits December 16, 2025 17:06

fixup! MON-4414: OEP for optional monitoring

1afc638

fixup! fixup! MON-4414: OEP for optional monitoring

6989b1f

fixup! fixup! fixup! MON-4414: OEP for optional monitoring

36cc1fe

rexagod force-pushed the MON-4414 branch from 07e8cb9 to 36cc1fe Compare December 16, 2025 13:36

OCPBUGS-X: Add Telemetry to CP set

6759c72

Add `Telemetry` collection profile to the set of offered collection profiles in CMO.

simonpasquier mentioned this pull request Dec 17, 2025

Enable Network Observability on Day 0 #1908

Open

simonpasquier reviewed Dec 24, 2025

View reviewed changes

	to preserve the historical UX.
	for backward compatibility and because monitoring is a key feature of OpenShift.

	> As a cluster administrator, I want to be able to disable as much
	> As a cluster administrator, I want to disable as much

MON-4414: Optional Monitoring Capability #1880

Are you sure you want to change the base?

MON-4414: Optional Monitoring Capability #1880

Uh oh!

Conversation

rexagod commented Oct 28, 2025

Uh oh!

openshift-ci-robot commented Oct 28, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Oct 28, 2025

Uh oh!

openshift-ci bot commented Oct 28, 2025

Uh oh!

simonpasquier commented Nov 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openshift-ci-robot commented Oct 28, 2025 •

edited by openshift-ci bot

Loading