Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,7 @@ To view and manage third-party integration settings, your user account must have
To assign a certificate and generate your metrics endpoint, follow these steps:

1. Log in to Temporal Cloud UI with an Account Owner or Global Admin [role](/cloud/users#account-level-roles).
2. Go to **Settings** and select **Integrations**.
3. Select **Configure Observability** (if you're setting it up for the first time) or click **Edit** in the Observability section (if it was already configured before).
2. Go to **Settings** and select **Observability**.
4. Add your root CA certificate (.pem) and save it.
Note that if an observability endpoint is already set up, you can append your root CA certificate here to use the generated observability endpoint in your observability tool.
5. To test your endpoint, run the following command on your host:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,12 @@ The total number of actions performed per second. Actions with `is_background=fa

**Type**: Rate

#### temporal\_cloud\_v1\_total\_action\_throttled\_count

The total number of actions throttled per second.

**Type**: Rate

#### temporal\_cloud\_v1\_operations\_count

Operations performed per second.
Expand Down
14 changes: 10 additions & 4 deletions docs/production-deployment/cloud/service-health.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -135,15 +135,14 @@ See [operations and metrics](/cloud/high-availability) for Namespaces with High
- [temporal\_cloud\_v1\_replication\_lag\_p95](/production-deployment/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p95)
- [temporal\_cloud\_v1\_replication\_lag\_p50](/production-deployment/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p50)

## Usage and Detecting Resource Exhaustion & Namespace RPS and APS Rate Limits
## Detecting Resource Exhaustion

The Cloud metric `temporal_cloud_v1_resource_exhausted_error_count` is the primary indicator for Cloud-side throttling, signaling that namespace limits
are being hit and `ResourceExhausted` gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized.
In fact, some workloads often run with high amounts of resource exhaustion errors because they are not latency sensitive. Being APS or RPS resource
constrained can slow down throughput and is a good indicator that you should request additional capacity.
constrained can slow down throughput and is a good indicator that you should request additional capacity.

To specifically identify whether RPS or APS limits are being hit, this metric can be filtered using the `resource_exhausted_cause` label, which will show values
like `ApsLimit` or `RpsLimit`. This label also helps identify the specific operation that was throttled (e.g., polling, respond activity tasks).
This metric can be filtered using the `resource_exhausted_cause` label. When this label shows a value other than `APSLimit`, `OPSLimit`, or `RPSLimit` it is unexpected.

## Monitoring Trends Against Limits

Expand All @@ -158,4 +157,11 @@ metrics with their corresponding count metrics to monitor general trends against
The [Grafana dashboard example](https://github.com/grafana/jsonnet-libs/blob/master/temporal-mixin/dashboards/temporal-overview.json) includes a Usage & Quotas section
that creates demo charts for these limits and count metrics respectively.

The limit metrics and count metrics are already directly comparable as per second rates. Keep in mind that each `count` metric is represented as a per second rate averaged
over each minute. For example, to get the total count of Actions, you must multiply this metric by 60.
When setting alerts against limits, consider if your workload is spikey or sensitive to throttling (e.g. does latency matter?). If your workload is sensitive, consider alerting
for `temporal_cloud_v1_total_action_count` at a 50% threshold of the `temporal_cloud_v1_action_limit`. If your workload is not sensitive, consider an alert at 90% of this threshold
or directly when throttling is detected as a value greater than zero for `temporal_cloud_v1_total_action_throttled_count`. This logic can also be used to automatically scale Temporal
Resource Units up or down as needed.