Performance Analysis and Improvements

**Why are we doing this?** 
Feedback from Release 1 showed that some query flows took too long to respond. In Release 2, we want to ensure that simple RAG queries return results within acceptable latency for large-scale usage.

**What is it?** 
Defines a structured performance test that measures end-to-end response times under load, helping to identify bottlenecks and guide tuning of orchestration and chunking if needed.

**Technical Guidelines** 
Use a load-testing tool such as Azure Load Testing with a scenario of \~1,000 concurrent users and an index of 5k+ documents. Measure response times across models (gpt-4.1, gpt-4o, gpt-4o-mini), aiming for <10s on simple retrieval-based queries. Document latency metrics (p50/p95) and optimize if targets are not met.

Let's start with a small number of virtual users (100 users asking questions in a 1 minute timeframe)


References: 
- https://github.com/Azure/GPT-RAG/tree/release/1.0.1/loadtest
- https://github.com/Azure/GPT-RAG/blob/release/1.0.1/docs/LOAD_TESTING.md
- https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/load-testing-rag-based-generative-ai-applications/4086993

cc: @cmw2 @jaspecla 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Analysis and Improvements #371

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Analysis and Improvements #371

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions