Skip to content

Conversation

@trial2onyx
Copy link
Collaborator

@trial2onyx trial2onyx commented Oct 24, 2025

Description

Updates the invoke llm and stream llm log events to include the token counts. For streaming events, use the relevant tokenizer to estimate the token counts and includes is_estimate metadata which can be filtered out in situations accuracy is relevant to the counts include_usage when applicable.

I don't think the braintrust sdk upgrade is relevant here, but is getting upgraded in #5918, so included for my convenience.

How Has This Been Tested?

Confirmed disabling braintrust does not cause errors.

Confirmed in braintrust values appear as expected with reasonable values,

Screenshot 2025-10-24 at 3 45 08 PM

Additional Options

  • Override Linear Check

Summary by cubic

Add token usage metrics to LLM tracing. “invoke llm” logs exact counts when available; “stream llm” logs estimated counts with an is_estimate flag.

  • New Features

    • Log prompt_tokens, completion_tokens, and total_tokens via current_span for invoke responses (from response.usage when present).
    • Estimate streaming token counts with the model’s tokenizer; include metadata is_estimate: true and estimate_method: tokenizer_estimate.
    • Logging is best-effort and never impacts the request path.
  • Dependencies

    • Upgrade braintrust[openai-agents] to 0.3.5.

@trial2onyx trial2onyx requested a review from a team as a code owner October 24, 2025 22:44
@vercel
Copy link

vercel bot commented Oct 24, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Oct 25, 2025 1:41am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Updates LLM tracing to include token counts in both invoke llm and stream llm events. For non-streaming invocations, it extracts actual token usage from the model response. For streaming responses where usage data isn't readily available, it estimates token counts using the model-specific tokenizer and marks them with is_estimate: true metadata.

Key changes:

  • Added current_span().log() calls in _invoke_implementation to log actual prompt/completion/total tokens from response usage
  • Added token estimation logic in _stream_implementation that tokenizes prompts and output to approximate token counts
  • Upgraded braintrust SDK from 0.2.6 to 0.3.5
  • Wrapped both logging calls in try/except to ensure logging failures don't break LLM calls

Confidence Score: 4/5

  • This PR is safe to merge with low risk - changes are well-isolated to tracing/logging and properly wrapped in error handlers
  • Score reflects defensive error handling (all tracing wrapped in try/except), non-critical observability improvements, and reasonable approach to token estimation. Minor concern is that current_span() assumes braintrust is initialized, but this is mitigated by the try/except blocks. The braintrust upgrade is noted by the author as part of PR #5918 and appears stable.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
backend/onyx/llm/chat_llm.py 4/5 Adds token count logging to invoke and stream LLM methods - invoked responses use actual usage data, streaming responses use tokenizer estimates with is_estimate metadata
backend/requirements/default.txt 5/5 Updates braintrust[openai-agents] from 0.2.6 to 0.3.5

Sequence Diagram

sequenceDiagram
    participant Client
    participant DefaultMultiLLM
    participant LiteLLM
    participant Braintrust
    participant Tokenizer

    Note over Client,Tokenizer: Non-Streaming Invocation Flow
    Client->>DefaultMultiLLM: invoke(prompt, tools, ...)
    DefaultMultiLLM->>LiteLLM: completion(stream=False)
    LiteLLM-->>DefaultMultiLLM: ModelResponse with usage
    DefaultMultiLLM->>Braintrust: current_span().log(metrics={prompt_tokens, completion_tokens, total_tokens})
    Note over Braintrust: Logs actual token counts from response.usage
    DefaultMultiLLM-->>Client: BaseMessage (output)

    Note over Client,Tokenizer: Streaming Invocation Flow
    Client->>DefaultMultiLLM: stream(prompt, tools, ...)
    DefaultMultiLLM->>LiteLLM: completion(stream=True)
    loop For each chunk
        LiteLLM-->>DefaultMultiLLM: StreamChunk
        DefaultMultiLLM->>DefaultMultiLLM: Accumulate chunks into output
        DefaultMultiLLM-->>Client: BaseMessageChunk
    end
    
    DefaultMultiLLM->>Tokenizer: get_tokenizer(model_name, provider_type)
    Tokenizer-->>DefaultMultiLLM: BaseTokenizer
    
    alt prompt is str
        DefaultMultiLLM->>Tokenizer: check_number_of_tokens(prompt)
        Tokenizer-->>DefaultMultiLLM: prompt_tokens_est
    else prompt is list/Sequence
        loop For each message
            DefaultMultiLLM->>Tokenizer: check_message_tokens(msg)
            Tokenizer-->>DefaultMultiLLM: message_tokens
        end
    end
    
    DefaultMultiLLM->>Tokenizer: check_number_of_tokens(output.content)
    Tokenizer-->>DefaultMultiLLM: completion_tokens_est
    
    DefaultMultiLLM->>Braintrust: current_span().log(metrics={...}, metadata={is_estimate: true})
    Note over Braintrust: Logs estimated token counts
Loading

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@trial2onyx
Copy link
Collaborator Author

Replaced the estimated token counting with provider usage data. Confirmed it worked with stream llm events using mistralai/devstral-small & openrouter and clarifier stream and process events using gpt-5-nano & openrouter. Also confirmed the option is ignored when streaming is disabled (include the case DISABLE_LITELLM_STREAMING=True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants