This workshop teaches systematic approaches to evaluating Generative AI workloads for production use. You'll learn to build evaluation frameworks that go beyond basic metrics to ensure reliable model performance while optimizing cost and performance.
Click here for a slide deck which covers the basics of evaluations and includes an overview of this workshop.
We strongly recommend going in order through the first 3 modules. These cover the core of generative AI evaluations which will be critical in all workloads. After that, please feel free to select any of the workload and framework specific modules in any order, according to what is most relevant to you.
- 01 Operational Metrics: evaluate how your workload is running in terms of cost and performance.
- 02 Quality Metrics: evaluate and tune the quality of your results.
- 03 Agentic Metrics: evaluate your agents and use agents for evaluation.
-
04 Workload Specific Metrics
- Intelligent Document Processing
- Guardrails
- Basic RAG
- Multi-modal RAG
- Synthetic Data Generation (Coming soon!)
- Speech 2 Speech (Coming soon!)
-
05 Framework and Tool Specific Implementations
- PromptFoo
- AgentCore
- BrainTrust (Coming soon!)
- Bedrock Evaluations (Coming soon!)
- AWS account with Amazon Bedrock enabled
- Basic Python and ML familiarity
- No security expertise required
- Clone the repository
- Configure AWS credentials
- Start with module 01-Operational-Metrics
- Complete the rest of the core modules, 02 and 03
- Review the workload and framework specific modules, choose any in any order.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
