feat: ai rate limiting standard headers#13049
Conversation
When using openai-compatible provider with Anthropic-format endpoints (e.g. DeepSeek's /anthropic/v1/messages), the response returns input_tokens/output_tokens instead of prompt_tokens/completion_tokens. This patch adds fallback support for both field names in both streaming and non-streaming paths, so token usage statistics work correctly regardless of which format the upstream LLM returns. Fixes token stats being 0 when proxying to Anthropic-compatible endpoints.
…uter-compatible rate-limit headers
| core.log.info("got token usage from ai service: ", | ||
| core.json.delay_encode(data.usage)) | ||
| ctx.llm_raw_usage = data.usage | ||
| local pt = data.usage.prompt_tokens or data.usage.input_tokens or 0 |
There was a problem hiding this comment.
These changes appear to be unrelated to this PR; please split them into separate PRs.
| default = "total_tokens", | ||
| description = "The strategy to limit the tokens" | ||
| }, | ||
| -- 使用 OpenRouter/OpenAI 兼容的标准头名,IDE 插件(Cursor/Continue)可直接识别 |
There was a problem hiding this comment.
Please use English comments throughout.
There was a problem hiding this comment.
Pull request overview
Adds an opt-in standard_headers flag to ai-rate-limiting so it can emit OpenAI/OpenRouter-style X-RateLimit-* response headers (for better client/IDE backoff behavior), along with tests and supporting documentation notes.
Changes:
- Add
standard_headersschema option and maplimit_strategytoX-RateLimit-*-{Tokens|PromptTokens|CompletionTokens}header names. - Add a new Test::Nginx suite validating standard header behavior, suffix mapping, and backward compatibility.
- Extend the OpenAI driver token usage parsing to support
input_tokens/output_tokenswithtotal_tokensfallback, and add a documentation patch file describingstandard_headers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
apisix/plugins/ai-rate-limiting.lua |
Adds standard_headers option and header-name switching logic in transform_limit_conf(). |
t/plugin/ai-rate-limiting-standard-headers.t |
New end-to-end tests covering standard headers + legacy headers behavior. |
docs/en/latest/plugins/ai-rate-limiting-standard-headers-patch.md |
New documentation content describing standard_headers and expected response headers. |
apisix/plugins/ai-drivers/openai-base.lua |
Token usage parsing expanded to accept input_tokens/output_tokens and derive total_tokens. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ai-rate-limiting — `standard_headers` Parameter | ||
|
|
||
| ## Overview | ||
|
|
||
| The `standard_headers` option makes `ai-rate-limiting` emit rate-limit response | ||
| headers that follow the [OpenRouter / OpenAI convention][openrouter-headers], | ||
| so IDE extensions such as **Cursor** and **Continue** can detect quota exhaustion | ||
| and apply automatic back-off without any custom configuration. | ||
|
|
||
| [openrouter-headers]: https://openrouter.ai/docs/api-reference/limits | ||
|
|
||
| ## New Parameter | ||
|
|
||
| | Parameter | Type | Default | Description | | ||
| |---|---|---|---| | ||
| | `standard_headers` | boolean | `false` | When `true`, emit OpenAI/OpenRouter-compatible rate-limit headers instead of the legacy `X-AI-RateLimit-*` headers. | | ||
|
|
There was a problem hiding this comment.
This file documents standard_headers, but the main plugin documentation (docs/en/latest/plugins/ai-rate-limiting.md) is still the canonical reference and currently has no mention of standard_headers (and its header names) while it explicitly documents X-AI-RateLimit-*. Please update ai-rate-limiting.md directly (and consider removing this “patch” file or clearly integrating/linking it) so users can discover the option in the standard docs.
| local pt = data.usage.prompt_tokens or data.usage.input_tokens or 0 | ||
| local ct = data.usage.completion_tokens or data.usage.output_tokens or 0 | ||
| ctx.ai_token_usage = { | ||
| prompt_tokens = data.usage.prompt_tokens or 0, | ||
| completion_tokens = data.usage.completion_tokens or 0, | ||
| total_tokens = data.usage.total_tokens or 0, | ||
| prompt_tokens = pt, | ||
| completion_tokens = ct, | ||
| total_tokens = data.usage.total_tokens or (pt + ct), |
There was a problem hiding this comment.
This PR introduces support for input_tokens/output_tokens (and a derived total_tokens) in the OpenAI driver, but that behavior change isn’t mentioned in the PR description. Please either update the PR description/scope to include it or move this change into a separate PR; also consider adding/adjusting tests that cover responses using input_tokens/output_tokens to prevent regressions.
| -- First request: should succeed and consume the 1-token budget | ||
| local res1, err = httpc:request_uri("http://127.0.0.1:" .. ngx.var.server_port .. "/anything", { | ||
| method = "POST", | ||
| headers = { | ||
| ["Content-Type"] = "application/json", | ||
| ["apikey"] = "test-key-123", | ||
| }, | ||
| body = [[{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}]], |
There was a problem hiding this comment.
The comment says the first request will “consume the 1-token budget”, but the mocked upstream returns total_tokens: 15. This can confuse future readers when interpreting why the second request is rate-limited. Consider rewording the comment to reflect that the first request exceeds the configured limit and drives Remaining to 0 for subsequent requests.
| -- standard_headers=true 输出 OpenRouter/OpenAI 兼容头名 | ||
| -- IDE 插件(Cursor/Continue)可直接识别并做退避 | ||
| if plugin_conf.standard_headers then | ||
| local strategy = plugin_conf.limit_strategy or "total_tokens" | ||
| local suffix = strategy == "total_tokens" and "Tokens" | ||
| or strategy == "prompt_tokens" and "PromptTokens" | ||
| or "CompletionTokens" | ||
| limit_conf.limit_header = "X-RateLimit-Limit-" .. suffix | ||
| limit_conf.remaining_header = "X-RateLimit-Remaining-" .. suffix | ||
| limit_conf.reset_header = "X-RateLimit-Reset-" .. suffix | ||
| else | ||
| limit_conf.limit_header = "X-AI-RateLimit-Limit-" .. name | ||
| limit_conf.remaining_header = "X-AI-RateLimit-Remaining-" .. name | ||
| limit_conf.reset_header = "X-AI-RateLimit-Reset-" .. name | ||
| end |
There was a problem hiding this comment.
The newly added standard_headers option does not take effect when conf.rules is configured. transform_limit_conf() returns early for rules, and limit-count always sets rule.header_prefix (index if unset), which forces X-{prefix}-RateLimit-* header names and ignores conf.limit_header/remaining_header/reset_header. Consider either (1) rejecting standard_headers=true when rules is set (schema/custom validation) and documenting the limitation, or (2) reworking the rules path so standard header names can be emitted.
| -- 使用 OpenRouter/OpenAI 兼容的标准头名,IDE 插件(Cursor/Continue)可直接识别 | ||
| -- true: X-RateLimit-Limit-Tokens / X-RateLimit-Remaining-Tokens / X-RateLimit-Reset-Tokens | ||
| -- false: X-AI-RateLimit-Limit-{instance} (原有行为) | ||
| standard_headers = {type = "boolean", default = false}, |
There was a problem hiding this comment.
The new inline comments for standard_headers are in Chinese, while the surrounding file and English docs use English. To keep the codebase consistent and accessible to all contributors, please translate these comments (or move the explanation into the schema description field in English).
| -- 使用 OpenRouter/OpenAI 兼容的标准头名,IDE 插件(Cursor/Continue)可直接识别 | |
| -- true: X-RateLimit-Limit-Tokens / X-RateLimit-Remaining-Tokens / X-RateLimit-Reset-Tokens | |
| -- false: X-AI-RateLimit-Limit-{instance} (原有行为) | |
| standard_headers = {type = "boolean", default = false}, | |
| standard_headers = { | |
| type = "boolean", | |
| default = false, | |
| description = "Use OpenRouter/OpenAI-compatible standard rate limit header names (true: X-RateLimit-Limit-Tokens / X-RateLimit-Remaining-Tokens / X-RateLimit-Reset-Tokens; false: keep original behavior with X-AI-RateLimit-Limit-{instance})" | |
| }, |
PR: feat(ai-rate-limiting): add
standard_headersoption for OpenAI/OpenRouter-compatible rate-limit headersSummary
Add a
standard_headersboolean option to theai-rate-limitingplugin.When enabled, the plugin emits rate-limit response headers that follow the
OpenAI / OpenRouter convention, allowing IDE extensions (Cursor, Continue, etc.)
to detect quota exhaustion and apply automatic back-off without any custom
client-side configuration.
Issue / Motivation
The current
ai-rate-limitingplugin outputs headers in the format:This format is APISIX-specific and not recognized by popular AI IDE extensions
such as Cursor and Continue. These tools look for the OpenAI/OpenRouter
standard headers:
Without these headers, IDE extensions cannot detect that they are being
rate-limited and will keep retrying immediately, causing a poor developer
experience and wasting quota.
Changes
apisix/plugins/ai-rate-limiting.luaAdded
standard_headersfield to the JSON Schema (boolean, defaultfalse).In
transform_limit_conf(): whenstandard_headersistrue, thelimit_header,remaining_header, andreset_headerfields passed tolimit-countare set to the standard names with a suffix derived fromlimit_strategy:limit_strategytotal_tokensTokensprompt_tokensPromptTokenscompletion_tokensCompletionTokensWhen
standard_headersisfalse(default), the originalX-AI-RateLimit-*-{instance_name}headers are used — fully backwardcompatible.
New / updated files
apisix/plugins/ai-rate-limiting.luat/plugin/ai-rate-limiting-standard-headers.tdocs/en/latest/plugins/ai-rate-limiting.mdTest Cases
The new test file
t/plugin/ai-rate-limiting-standard-headers.tcovers:standard_headers: trueis accepted bycheck_schema.standard_headersdefaults tofalse.standard_headers: truereturns all three
X-RateLimit-*-Tokensheaders with numeric values.X-RateLimit-Remaining-Tokens: 0.prompt_tokenssuffix —limit_strategy: prompt_tokensproducesX-RateLimit-*-PromptTokensheaders.completion_tokenssuffix —limit_strategy: completion_tokensproducesX-RateLimit-*-CompletionTokensheaders.standard_headers: falsestill produces thelegacy
X-AI-RateLimit-*-{instance_name}headers.Running the tests locally
Documentation
See
docs/en/latest/plugins/ai-rate-limiting-standard-headers-patch.mdfor theparameter reference table, configuration example, and sample response headers.
Checklist
standard_headersdefaults tofalse)CHANGELOGentry (to be added before merge)Related