Replies: 3 comments
-
|
This appears to be related to how the LlamaIndex OpenInference instrumentor emits spans, rather than a Langfuse-specific issue. The Langfuse SDK integrates with third-party OTEL-based instrumentation libraries like From your screenshot, it looks like both the parent Potential approaches:
from langfuse import Langfuse
from langfuse.span_filter import is_default_export_span
langfuse = Langfuse(
should_export_span=lambda span: (
is_default_export_span(span)
and not span.name.endswith(".astream") # Example filter logic
)
)
from langfuse import Langfuse
langfuse = Langfuse(debug=True)This will help you identify exactly which spans from the LlamaIndex instrumentor are carrying the usage details, so you can determine whether to filter them or raise the issue with the OpenInference maintainers. The core issue is that the LlamaIndex instrumentor appears to be setting 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @rayw-lr, in this case it's best to contact OpenInference in regards to their LlamaIndexInstrumentor tool and ask for support. However, the code proposed above is a good workaround for a fast solution. |
Beta Was this translation helpful? Give feedback.
-
|
Problem: duplicated token counts on generation spans when using LlamaIndex Instrumentor + Langfuse. The issue: Langfuse aggregates tokens from generation spans, causing double counting because parent and child spans both report token usage. Likely due to instrumentation emitting spans at multiple levels (e.g., astream and astream_chat). Need to suggest adjusting token attribution: either disable token reporting on parent spans, or configure Langfuse to not sum tokens from certain spans, or modify instrumentation to avoid double counting. Actually OpenInference instrumentation for LlamaIndex has an option to disable token usage reporting? Let's recall: OpenInference instrumentation for LlamaIndex captures token usage from LLM calls and adds attributes like |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe your question
Hi there,
I'm using the following to generate answers based on a user's query:
RetrieverQueryEngineResponse Synthesizer- streaming=True- use_async=TrueWe use a combination of Langfuse's OTEL instrumentation (observe decorator and / or context manager). However, this alone does not capture the OTEL spans emitted from Llamaindex-specific function calls. For example, any step conducted from
RetrieverQueryEngine.aquerywould be missed - if we do not useLlamaIndexInstrumentor().instrument()In the attached screenshot, see the
AzureOpenAI.*generation spans which report the tokens used at those steps. Langfuse seems to automatically aggregate all tokens reported from generation spans - resulting in over-inflated reported token usage. I'm not sure if these parent spans (i.e.,astream) should be counting these tokens - when it seems like the actual usage comes from the child span (astream_chat)I believe this affects all nested generation spans. Not sure if this issue is inherent to langfuse or to llamaindex's instrumentor.
Could I get some help navigating this issue?
Langfuse Cloud or Self-Hosted?
Self-Hosted
If Self-Hosted
3.162.0
If Langfuse Cloud
No response
SDK and integration versions
Langfuse Python SDK v4.01
Latest versions of these
Pre-Submission Checklist
Beta Was this translation helpful? Give feedback.
All reactions