Engineering reliable LLM systems through reproducible evaluation, adversarial testing, behavioral drift monitoring and deterministic pipelines.
A production-style evaluation framework for testing the reliability of LLM pipelines.
This repository provides a laboratory for evaluating the reliability of LLM pipelines.
It simulates a realistic LLM system (retrieval + tools + validation) and measures behavior under:
- tool failures
- schema violations
- prompt injection attacks
- behavioral drift between models
- retry and recovery policies
The goal is to make LLM system reliability measurable and reproducible.
LLM systems often perform well in demos but fail in production environments.
Typical failure modes include:
- malformed tool outputs
- schema-breaking responses
- prompt injection attacks
- tool hijacking
- behavioral drift between model versions
- lack of reproducible evaluation
This project provides a controlled environment to test those failures systematically.
The framework provides:
- contract-first LLM pipelines using strict JSON schemas
- tool-calling robustness evaluation
- fault injection for tool failures
- retry and recovery policy testing
- adversarial prompt testing (red teaming)
- behavioral drift monitoring across models
- deterministic execution manifests
- structured artifacts for inspection and reproducibility
flowchart LR
A[Benchmark Cases] --> B[Evaluation Runner]
B --> C[SUT Pipeline]
C --> D[Prompt Builder]
C --> E[Retriever]
C --> F[Tools]
F --> G[Tool Executor]
C --> H[Output Validator]
H --> I[Schema Validation]
H --> J[Policy Guardrails]
H --> K[Run Artifacts]
K --> L[Metrics]
K --> M[Events Log]
K --> N[Execution Manifest]
More details in:
- docs/architecture.md
- docs/contracts.md
Run the deterministic demo pipeline.
python -m llm_lab.cli demo --backend mockArtifacts are stored in:
runs/<timestamp>/
Example structure:
runs/<timestamp>/
output.json
metrics.json
events.jsonl
run_manifest.json
Example output:
schema_compliance_rate: 1.0
tool_success_rate: 1.0
success_rate: 1.0
python -m llm_lab.cli eval --suite reliabilityExample output from a real evaluation run:
schema_compliance_rate: 1.0
tool_success_rate: 0.5
tool_retry_rate: 1.25
recovery_rate: 1.0
Artifacts generated:
runs/20260311-122013-a0ede4b3/
metrics.json
reliability_report.json
python -m llm_lab.cli redteampython -m llm_lab.cli drift --matrix configs/drift_matrix.yamlFull methodology in:
- docs/evaluation.md
{
"question": "What is the capital of France?",
"expected_answer": "Paris"
}python -m llm_lab.cli demo --backend mock{
"answer": "Paris",
"citations": ["wiki_france"],
"tool_calls": []
}{
"schema_compliance_rate": 1.0,
"tool_success_rate": 1.0,
"success_rate": 1.0
}Artifacts generated per run:
| Artifact | Purpose |
|---|---|
| metrics.json | reliability metrics |
| events.jsonl | execution trace |
| output.json | final model output |
| run_manifest.json | reproducibility metadata |
Reproducibility details:
- docs/reproducibility.md
| Metric | Description |
|---|---|
| schema_compliance_rate | % outputs valid against JSON schema |
| tool_success_rate | successful tool executions |
| recovery_rate | failures recovered via retry |
| attack_success_rate | adversarial prompts bypassing defenses |
| drift_score | behavioral change between models |
| answer_hash_stability_rate | stability of generated answers |
Backend: mock
Used for automated testing.
Backend example: ollama
python -m llm_lab.cli demo --backend ollama --model qwen2.5:7b-instruct| Project | Focus |
|---|---|
| OpenAI Evals | model reasoning benchmarks |
| HELM | holistic model evaluation |
| Guardrails AI | output validation |
| LangChain | LLM application framework |
| LLM Systems Reliability Lab | reliability evaluation of LLM pipelines |
- contract-first outputs
- deterministic execution
- observable pipelines
- failure-aware design
- reproducible evaluation
src/llm_lab
pipeline/ pipeline orchestration
tools/ tool schemas and execution
retrieval/ contextual retrieval
llm/ model backends
evals/ evaluation logic
redteam/ adversarial testing
drift/ model drift analysis
configs/ pipeline configuration
data/ benchmark datasets
tests/ unit tests
runs/ generated artifacts
docs/ technical documentation
- docs/architecture.md
- docs/contracts.md
- docs/evaluation.md
- docs/failure_modes.md
- docs/reproducibility.md
- docs/design_decision.md
- docs/development.md
- Mock mode guarantees deterministic execution but does not represent real model quality.
- Model-based grading introduces subjective evaluation signals.
- Drift measurements depend on the selected benchmark cases.
- This framework evaluates pipeline reliability, not general LLM capability.
Apache License 2.0
Copyright (c) 2026 Malena Pérez Sevilla
Licensed under the Apache License, Version 2.0.