-
Notifications
You must be signed in to change notification settings - Fork 295
Description
Why are we doing this?
Feedback from Release 1 showed that some query flows took too long to respond. In Release 2, we want to ensure that simple RAG queries return results within acceptable latency for large-scale usage.
What is it?
Defines a structured performance test that measures end-to-end response times under load, helping to identify bottlenecks and guide tuning of orchestration and chunking if needed.
Technical Guidelines
Use a load-testing tool such as Azure Load Testing with a scenario of ~1,000 concurrent users and an index of 5k+ documents. Measure response times across models (gpt-4.1, gpt-4o, gpt-4o-mini), aiming for <10s on simple retrieval-based queries. Document latency metrics (p50/p95) and optimize if targets are not met.
Let's start with a small number of virtual users (100 users asking questions in a 1 minute timeframe)
References: