Skip to content

Performance Analysis and Improvements #371

@placerda

Description

@placerda

Why are we doing this?
Feedback from Release 1 showed that some query flows took too long to respond. In Release 2, we want to ensure that simple RAG queries return results within acceptable latency for large-scale usage.

What is it?
Defines a structured performance test that measures end-to-end response times under load, helping to identify bottlenecks and guide tuning of orchestration and chunking if needed.

Technical Guidelines
Use a load-testing tool such as Azure Load Testing with a scenario of ~1,000 concurrent users and an index of 5k+ documents. Measure response times across models (gpt-4.1, gpt-4o, gpt-4o-mini), aiming for <10s on simple retrieval-based queries. Document latency metrics (p50/p95) and optimize if targets are not met.

Let's start with a small number of virtual users (100 users asking questions in a 1 minute timeframe)

References:

cc: @cmw2 @jaspecla

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions