-
Notifications
You must be signed in to change notification settings - Fork 529
MON-4414: Optional Monitoring Capability #1880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@rexagod: This pull request references MON-4414 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| 4. Should we downscale Prometheus to a single replica when the | ||
| capability is disabled? | ||
| > Yes, since the monitoring footprint is reduced significantly | ||
| when the capability is disabled, moving away from an HA setup | ||
| to a single replica setup makes sense from a resource consumption | ||
| perspective. All components across OpenShift that rely on Thanos | ||
| will need to be "taught" to query the single Prometheus replica | ||
| directly instead of going through Thanos Querier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(noting here that in this case, we will not support Genie; PTAL at this discussion)
simonpasquier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! Let's focus on the overview and external picture first before going to the implementation details :)
| (a) by "optional" we mean components, or parts of components, that | ||
| are not required for [telemetry operations], and, | ||
| (b) the capability will be enabled by default (implicity enabled), | ||
| to preserve the historical UX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| to preserve the historical UX. | |
| for backward compatibility and because monitoring is a key feature of OpenShift. |
| Based on this information, we can code certain behaviors in the | ||
| monitoring operator that wouldn't otherwise make sense, and would | ||
| help reduce the overall monitoring footprint not just across the | ||
| stack, but the cluster itself (since we'd be sure of the intent), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) "platform" (in the sense of OpenShift Container Platform) rather than "cluster"?
| life-cycled, monitored and remediated at scale. | ||
| --> | ||
|
|
||
| > As a cluster administrator, I want to be able to disable as much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
| > As a cluster administrator, I want to be able to disable as much | |
| > As a cluster administrator, I want to disable as much |
|
|
||
| > As a cluster administrator, I want to be able to disable as much | ||
| of the monitoring footprint as possible without breaking the cluster, | ||
| including any managed manifests, so that I can minimize the resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean by "managed manifests"? I'd suggest to remove the "without breaking the cluster" since in all cases we don't want that to happen.
| of the monitoring footprint as possible without breaking the cluster, | ||
| including any managed manifests, so that I can minimize the resource | ||
| consumption of monitoring on my cluster. | ||
| > As a cluster administrator, I want to be able to disable as much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe merge this intent with the previous one?
can we add an "openshift dev" story about not losing the telemetry signal?
| only, it makes sense to enable the "telemetry" collection profile | ||
| when the capability is disabled. This allows us to actually regulate | ||
| the metrics ingestion from exporters that would otherwise push data | ||
| into Prometheus that is not telemetry-related. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my question above on layered operators, I wonder what will be the consequences when monitoring=disabled and telemetry=disabled. Do they need to implement all profiles (including telemetry) to get a signal?
| when the capability is disabled, moving away from an HA setup | ||
| to a single replica setup makes sense from a resource consumption | ||
| perspective. All components across OpenShift that rely on Thanos | ||
| will need to be "taught" to query the single Prometheus replica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the data they're interested in is available. I'd prefer to declare that when monitoring is disabled, we pretend to expose no service.
| what should be the behavior? | ||
| > Since telemetry is orthogonal to the capability itself, opting | ||
| out of telemetry should lead to disabling all telemetry-related | ||
| components (e.g., exporters, telemetry-specific `PrometheusRules` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one could argue that exporters should be on.
| exception, and will remain enabled to ensure that the autoscaling | ||
| pipelines are not affected. | ||
|
|
||
| 6. Should the capability be named `OptionalMonitoring` or just `Monitoring`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced by OptionalMonitoring. What about PlatformMonitoring?
|
|
||
| 4. Should we downscale Prometheus to a single replica when the | ||
| capability is disabled? | ||
| > Yes, since the monitoring footprint is reduced significantly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with the fact that we may have lower availability for subscription usage?
Goes over the details for the `OptionalMonitoring` capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.
Add `Telemetry` collection profile to the set of offered collection profiles in CMO.
|
@rexagod: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| Disabling the capability translates to hypershift clusters pointing | ||
| [`METRICS_SET` environment variable] to `Telemetry`, in order to | ||
| minimize the monitoring footprint while ensuring that telemetry | ||
| operations are not affected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that disabling capability should have an impact on the METRICS_SET parameter. An operator may decide to turn off OCP monitoring and collect hosted cluster metrics by other means?
| behaviors also be exposed to MicroShift admins through the | ||
| configuration file for MicroShift? | ||
| --> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) can we add a separate section for single-node openshift? Basically we can state that it would help reduce consumption of SNO clusters.
| DPUs exist specifically to offload infrastructure overhead from host x86 | ||
| servers, achieving up to 70% CPU savings on the host. Running full monitoring | ||
| stacks on the DPU ARM cores would defeat this purpose by consuming the limited | ||
| resources meant for high-performance networking and infrastructure services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"defeat" is a strong statement IMHO. From our conversations, I have the feeling that the current architecture has limited resources for workloads other than pure networking functions which drives a need for limiting the resource usage of platform components (especially with hosted control planes).
The consequence for this proposal is that when both RH telemetry and the capability are disabled, CMO shouldn't deploy Prometheus at all.
| - `ClusterOperator` status: Expose whether the capability is | ||
| enabled or disabled through a `ClusterOperator` condition. | ||
| * Teach other components about the capability: | ||
| - Hypershift: Setting the `METRICS_SET` environment variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
Goes over the details for the
OptionalMonitoringcapability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.