diff --git a/docs/production-deployment/cloud/metrics/general-setup.mdx b/docs/production-deployment/cloud/metrics/general-setup.mdx index f9cccce8d1..553a042f61 100644 --- a/docs/production-deployment/cloud/metrics/general-setup.mdx +++ b/docs/production-deployment/cloud/metrics/general-setup.mdx @@ -42,8 +42,7 @@ To view and manage third-party integration settings, your user account must have To assign a certificate and generate your metrics endpoint, follow these steps: 1. Log in to Temporal Cloud UI with an Account Owner or Global Admin [role](/cloud/users#account-level-roles). -2. Go to **Settings** and select **Integrations**. -3. Select **Configure Observability** (if you're setting it up for the first time) or click **Edit** in the Observability section (if it was already configured before). +2. Go to **Settings** and select **Observability**. 4. Add your root CA certificate (.pem) and save it. Note that if an observability endpoint is already set up, you can append your root CA certificate here to use the generated observability endpoint in your observability tool. 5. To test your endpoint, run the following command on your host: diff --git a/docs/production-deployment/cloud/metrics/openmetrics/metrics-reference.mdx b/docs/production-deployment/cloud/metrics/openmetrics/metrics-reference.mdx index b6e7311bfc..1a79a991ec 100644 --- a/docs/production-deployment/cloud/metrics/openmetrics/metrics-reference.mdx +++ b/docs/production-deployment/cloud/metrics/openmetrics/metrics-reference.mdx @@ -318,6 +318,12 @@ The total number of actions performed per second. Actions with `is_background=fa **Type**: Rate +#### temporal\_cloud\_v1\_total\_action\_throttled\_count + +The total number of actions throttled per second. + +**Type**: Rate + #### temporal\_cloud\_v1\_operations\_count Operations performed per second. diff --git a/docs/production-deployment/cloud/service-health.mdx b/docs/production-deployment/cloud/service-health.mdx index a2f457f6f9..9a0027514d 100644 --- a/docs/production-deployment/cloud/service-health.mdx +++ b/docs/production-deployment/cloud/service-health.mdx @@ -135,15 +135,14 @@ See [operations and metrics](/cloud/high-availability) for Namespaces with High - [temporal\_cloud\_v1\_replication\_lag\_p95](/production-deployment/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p95) - [temporal\_cloud\_v1\_replication\_lag\_p50](/production-deployment/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p50) -## Usage and Detecting Resource Exhaustion & Namespace RPS and APS Rate Limits +## Detecting Resource Exhaustion The Cloud metric `temporal_cloud_v1_resource_exhausted_error_count` is the primary indicator for Cloud-side throttling, signaling that namespace limits are being hit and `ResourceExhausted` gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized. In fact, some workloads often run with high amounts of resource exhaustion errors because they are not latency sensitive. Being APS or RPS resource -constrained can slow down throughput and is a good indicator that you should request additional capacity. +constrained can slow down throughput and is a good indicator that you should request additional capacity. -To specifically identify whether RPS or APS limits are being hit, this metric can be filtered using the `resource_exhausted_cause` label, which will show values -like `ApsLimit` or `RpsLimit`. This label also helps identify the specific operation that was throttled (e.g., polling, respond activity tasks). +This metric can be filtered using the `resource_exhausted_cause` label. When this label shows a value other than `APSLimit`, `OPSLimit`, or `RPSLimit` it is unexpected. ## Monitoring Trends Against Limits @@ -158,4 +157,11 @@ metrics with their corresponding count metrics to monitor general trends against The [Grafana dashboard example](https://github.com/grafana/jsonnet-libs/blob/master/temporal-mixin/dashboards/temporal-overview.json) includes a Usage & Quotas section that creates demo charts for these limits and count metrics respectively. +The limit metrics and count metrics are already directly comparable as per second rates. Keep in mind that each `count` metric is represented as a per second rate averaged +over each minute. For example, to get the total count of Actions, you must multiply this metric by 60. +When setting alerts against limits, consider if your workload is spikey or sensitive to throttling (e.g. does latency matter?). If your workload is sensitive, consider alerting +for `temporal_cloud_v1_total_action_count` at a 50% threshold of the `temporal_cloud_v1_action_limit`. If your workload is not sensitive, consider an alert at 90% of this threshold +or directly when throttling is detected as a value greater than zero for `temporal_cloud_v1_total_action_throttled_count`. This logic can also be used to automatically scale Temporal +Resource Units up or down as needed. +