Skip to content

[RFC] Per-Workload-Group Request Throttling for Workload Management #21064

@dzane17

Description

@dzane17

Is your feature request related to a problem? Please describe

Summary

Add per-workload-group request throttling to OpenSearch's Workload Management (WLM) feature. This introduces a proactive mechanism that limits search request throughput per workload group, complementing the existing reactive cancellation and rejection mechanisms.

Motivation

WLM currently provides two enforcement actions when resource limits are breached:

  1. Rejection — New requests are rejected at admission time when the workload group's last-recorded resource usage exceeds the normalized rejection threshold (rejectIfNeeded in WorkloadGroupService).
  2. Cancellation — A periodic background loop (default 1s interval) cancels the highest-resource-consuming tasks when usage exceeds cancellation thresholds (WorkloadGroupTaskCancellationService).

Both mechanisms are reactive — they respond after resource limits are already breached. This creates several gaps:

  • Burst vulnerability: A workload group can submit thousands of requests in a sub-second burst. By the time the 1s enforcement loop detects the resource spike, significant damage is done.
  • No concurrency control: There is no mechanism to cap how many requests a workload group can have in-flight simultaneously, independent of resource usage. An operator may want to limit a low-priority analytics group to 20 concurrent searches regardless of whether resource thresholds are breached.

Throttling fills this gap by providing a preventive layer that reduces how often cancellation and rejection need to fire.

Design

Throttle Metric

The throttle needs a measurable trigger to determine when a workload group has exceeded its limit. Two primary candidates are discussed below. The choice between them (or a combination) is an open design decision.

Option A: In-flight (concurrent) request count

  • Correlates with resource usage: each in-flight search request holds threads, memory (aggregation buffers, field data), and CPU. Counting active requests is a direct proxy for resource consumption.
  • Self-regulating: if queries are fast, the in-flight count stays low and more requests are admitted. If queries are slow, the count rises and new requests are throttled. The system automatically adapts to query complexity without configuration changes.

Option B: Requests per second (rate-based)

  • Provides a deterministic, predictable knob: "this group gets at most N requests/sec."
  • Better suited for protecting against high-frequency floods of cheap queries where in-flight count stays low because each finishes quickly.
  • Cost-blind: treats a trivial match_all the same as an expensive multi-aggregation query. Requires the operator to estimate the right rate for their workload mix.
  • Requires a time-windowing mechanism (e.g., token bucket) which adds implementation complexity.

Over-Limit Behavior

When a workload group exceeds its throttle limit, the system must decide what to do with the excess requests.

Option A: Immediate rejection

  • Client gets an immediate error (HTTP 429 / OpenSearchRejectedExecutionException) and can back off or retry.
  • No server-side resource cost for holding excess requests.
  • Consistent with existing WLM rejection behavior and how other destination systems (K8s API server, AWS API Gateway) handle over-limit requests.
  • Provides a clear, actionable signal to the client.

Option B: Queue then reject

  • Excess requests are held in a bounded queue and processed when the throttle allows.
  • Friendlier for simple clients (dashboards, ad-hoc queries) that lack retry logic.
  • Adds complexity: queue depth limits, timeout handling, memory pressure from held requests.
  • Each queued request holds an open HTTP connection, consuming a network socket, memory for buffers, and a slot in the Netty event loop.
  • Risk of queues that don't drain: unlike thread pool queues that absorb microsecond-level bursts, a WLM throttle limit is intentional policy, so excess requests may accumulate indefinitely.

Interaction with Resiliency Modes

The throttle should respect the existing resiliency mode semantics:

  • ENFORCED: Throttle limit is always enforced.
  • SOFT: Throttle limit is enforced only when the node is in duress (consistent with how SOFT mode handles rejection and cancellation today).
  • MONITOR: Throttle violations are logged but requests are not rejected (consistent with monitor-only behavior).

Configuration

The throttle limit should be configurable per workload group as a top-level field in the workload group definition. The exact setting name and semantics depend on the chosen throttle metric:

PUT _wlm/workload_group/analytics
{
  "resiliency_mode": "enforced",
  "resource_limits": {
    "cpu": 0.4,
    "memory": 0.4
  },
  "throttle_limit": 50
}

Note: This initial design uses a single throttle_limit value that applies to search requests (the only request type WLM currently manages). If WLM later tracks indexing or other request types, this field should be restructured into a per-type map (e.g., "throttle_limits": {"search": 50, "indexing": 200}).

Observability

A new counter totalThrottled should be added to WorkloadGroupState alongside the existing totalRejections and totalCancellations, and exposed in the WLM stats API. This allows operators to distinguish between throttle rejections (concurrency limit) and resource-based rejections (CPU/memory threshold).

Throttling in Other Systems

Several systems implement request throttling with design choices relevant to this feature.

NGINX limit_req uses a token bucket algorithm with a configurable rate and optional burst buffer. Requests within the rate pass through. Requests above the rate but within the burst are queued and delayed. Requests exceeding both are rejected with HTTP 503. A nodelay option lets burst requests through immediately without queuing, then rejects once the bucket is empty.

AWS API Gateway sets rate limits (requests/sec) with a configurable burst allowance. Over-limit requests receive an immediate HTTP 429 with no server-side queuing. Clients are expected to implement exponential backoff.

Snowflake Virtual Warehouses allow operators to set concurrency limits per warehouse. When the limit is reached, queries are queued up to a configurable timeout, after which they are rejected. This is a queue-then-reject model applied to an analytical query engine — a workload pattern similar to OpenSearch search.

Open Questions

  1. Throttle metric: Should the throttle be based on in-flight request count, requests-per-second, or both? In-flight count is self-regulating and correlates with resource usage; req/sec is more predictable and better for cheap-query floods. See the Throttle Metric section for tradeoffs.
  2. Over-limit behavior: Should excess requests be rejected immediately or queued with a bounded buffer? Immediate rejection is simpler and avoids server-side resource cost; queuing is friendlier for simple clients. See the Over-Limit Behavior section for tradeoffs.
  3. Should the throttle be tracked per-node or cluster-wide? Cluster-wide tracking is likely infeasible — it would require either a centralized counter (adding a network round-trip to every request), eventually-consistent gossip (allowing overshoot during bursts), or static quota partitioning across nodes (wasting capacity when traffic is unevenly distributed). Per-node is the expected approach.

Describe the solution you'd like

See above

Related component

Search:Resiliency

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:ResiliencyenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions