The AI is fake. The API is fake. The responses are fake. But your code? That's real. Or is it? Welcome to the simulation.
FakeAI simulates the complete OpenAI API, as well as numerous NVIDIA AI services (NIM, AI-Dynamo, DCGM, Cosmos) with instant feedback and reproducible results. Develop and optimize your applications locally with realistic service behavior, then deploy to production infrastructure when ready.
- Millisecond response times - Test and debug without waiting for infrastructure
- Reproducible results - Consistent behavior across development, CI/CD, and testing
- Performance optimization - Profile and tune before production deployment
- Local development - Full-featured testing environment on any machine
- NIM (NVIDIA Inference Microservices) - Reranking API and optimized model endpoints
- AI-Dynamo - KV cache management, smart routing, and prefix caching
- DCGM - 100+ GPU telemetry metrics for A100, H100, H200, Blackwell
- Cosmos - Video understanding with token calculation
- Real implementations - Actual service logic, not mocks or stubs
- 100+ endpoints - Chat, embeddings, images, audio, fine-tuning, vector stores
- Streaming support - Realistic TTFT and ITL with 37+ model-specific profiles
- Advanced features - Function calling, structured outputs, vision, reasoning models
- Drop-in replacement - Works with OpenAI SDK, LangChain, LlamaIndex
- AIPerf integration - Industry-standard performance profiling
- KV cache metrics - Analyze cache hit rates and optimization opportunities
- Load testing - Validate behavior under various concurrency levels
- Latency profiling - Realistic timing for capacity planning
- Key Features
- Quick Start
- API Endpoints
- NVIDIA Features
- AIPerf Benchmarking
- Advanced Features
- Configuration
- Installation
- Use Cases
- Documentation
- Chat Completions - Streaming/non-streaming with 62 parameters
- Text Completions - Legacy endpoint support
- Embeddings - L2-normalized vectors with semantic similarity
- Image Generation - DALL-E compatible with actual PNG generation
- Audio (TTS) - Text-to-speech with multiple voices and formats
- Audio (STT) - Whisper-compatible transcription
- Moderation - 11-category content safety
- File Management - Upload, retrieve, delete with metadata
- Batch Processing - Async job execution with status tracking
- Realtime API - WebSocket bidirectional streaming
- Responses API - Stateful conversation management
- Function Calling - Parallel tool execution
- Structured Outputs - JSON Schema validation
- Vision - Multi-modal image input
- Video - Multi-modal video input (Cosmos)
- Reasoning Models - O1-style chain-of-thought
- Predicted Outputs - EAGLE speculative decoding (3-5Ă— speedup)
- Fine-tuning - Complete job lifecycle with LoRA
- Vector Stores - RAG infrastructure
- Organization Management - Users, roles, invites
- Project Management - Multi-tenancy with isolation
- Service Accounts - API key management
- Usage Tracking - Detailed usage metrics by endpoint
- Cost Analytics - Estimated costs with breakdowns
- Rate Limiting - Per-key RPM, TPM, RPD, TPD with tiers
- API Key Authentication - Bearer token with SHA-256 hashing
- Rate Limiting - Configurable tiers (Free, Tier 1-5)
- Abuse Detection - Anomaly detection and IP banning
- Input Validation - Injection attack detection
- Error Injection - Configurable failure simulation
- CORS Configuration - Cross-origin control
pip install fakeai# Basic startup (localhost:8000)
fakeai server
# Custom configuration
fakeai server --port 9000 --host 0.0.0.0
# Zero latency for maximum throughput
fakeai server --ttft 0 --itl 0from openai import OpenAI
client = OpenAI(
api_key="any-key-works",
base_url="http://localhost:8000"
)
# Chat completion
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Count to 10"}],
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="", flush=True)# Health check
curl http://localhost:8000/health
# Server metrics
curl http://localhost:8000/metrics
# KV cache stats
curl http://localhost:8000/kv-cache/metrics
# DCGM GPU metrics
curl http://localhost:8000/dcgm/metrics/json
# Dynamo inference metrics
curl http://localhost:8000/dynamo/metrics/json| Endpoint | Methods | Description |
|---|---|---|
/v1/models |
GET | List available models |
/v1/models/{id} |
GET | Get model details |
/v1/models/{id}/capabilities |
GET | Get model capabilities (context, pricing, features) |
/v1/chat/completions |
POST | Chat completions (streaming/non-streaming) |
/v1/completions |
POST | Text completions (legacy) |
/v1/embeddings |
POST | Generate embeddings |
/v1/images/generations |
POST | Generate images |
/v1/audio/speech |
POST | Text-to-speech synthesis |
/v1/audio/transcriptions |
POST | Audio transcription |
/v1/moderations |
POST | Content moderation |
/images/{id}.png |
GET | Retrieve generated image |
| Endpoint | Methods | Description |
|---|---|---|
/v1/files |
GET, POST | File management |
/v1/files/{id} |
GET, DELETE | File operations |
/v1/files/{id}/content |
GET | Download file content |
/v1/batches |
POST, GET | Batch processing |
/v1/batches/{id} |
GET | Batch status |
/v1/batches/{id}/cancel |
POST | Cancel batch |
| Endpoint | Methods | Description |
|---|---|---|
/v1/fine_tuning/jobs |
POST, GET | Create and list fine-tuning jobs |
/v1/fine_tuning/jobs/{id} |
GET | Get job details |
/v1/fine_tuning/jobs/{id}/cancel |
POST | Cancel job |
/v1/fine_tuning/jobs/{id}/events |
GET | Stream job events (SSE) |
/v1/fine_tuning/jobs/{id}/checkpoints |
GET | List checkpoints |
| Endpoint | Methods | Description |
|---|---|---|
/v1/vector_stores |
POST, GET | Create and list vector stores |
/v1/vector_stores/{id} |
GET, POST, DELETE | Vector store operations |
/v1/vector_stores/{id}/files |
POST, GET | File management |
/v1/vector_stores/{id}/files/{file_id} |
GET, DELETE | File operations |
/v1/vector_stores/{id}/file_batches |
POST, GET | Batch file operations |
/v1/vector_stores/{id}/file_batches/{batch_id} |
GET, POST | Batch operations |
/v1/vector_stores/{id}/file_batches/{batch_id}/files |
GET | List files in batch |
| Endpoint | Methods | Description |
|---|---|---|
/v1/organization/users |
GET, POST | User management |
/v1/organization/users/{id} |
GET, POST, DELETE | User operations |
/v1/organization/invites |
GET, POST | Invitation management |
/v1/organization/invites/{id} |
GET, DELETE | Invite operations |
/v1/organization/projects |
GET, POST | Project management |
/v1/organization/projects/{id} |
GET, POST | Project operations |
/v1/organization/projects/{id}/archive |
POST | Archive project |
/v1/organization/projects/{id}/users |
GET, POST | Project user management |
/v1/organization/projects/{id}/users/{user_id} |
GET, POST, DELETE | User operations |
/v1/organization/projects/{id}/service_accounts |
GET, POST | Service account management |
/v1/organization/projects/{id}/service_accounts/{sa_id} |
GET, DELETE | Service account operations |
| Endpoint | Methods | Description |
|---|---|---|
/v1/organization/usage/completions |
GET | Completions usage by time bucket |
/v1/organization/usage/embeddings |
GET | Embeddings usage by time bucket |
/v1/organization/usage/images |
GET | Images usage by time bucket |
/v1/organization/usage/audio_speeches |
GET | TTS usage by time bucket |
/v1/organization/usage/audio_transcriptions |
GET | STT usage by time bucket |
/v1/organization/costs |
GET | Cost data with grouping |
| Endpoint | Protocol | Description |
|---|---|---|
/v1/realtime |
WebSocket | Real-time bidirectional streaming |
/v1/responses |
POST | Stateful conversation API |
/v1/ranking |
POST | NVIDIA NIM reranking |
/v1/text/generation |
POST | Azure text generation compatibility |
/rag/api/prompt |
POST | Solido RAG retrieval-augmented generation |
| Endpoint | Methods | Description |
|---|---|---|
/health |
GET | Basic health check |
/health/detailed |
GET | Detailed health with metrics summary |
/dashboard |
GET | Interactive metrics dashboard |
/dashboard/dynamo |
GET | Advanced Dynamo dashboard |
| Endpoint | Methods | Description |
|---|---|---|
/metrics |
GET | Server metrics (JSON) |
/metrics/prometheus |
GET | Prometheus metrics format |
/metrics/csv |
GET | CSV export |
/metrics/stream |
WebSocket | Real-time metrics streaming |
| Endpoint | Methods | Description |
|---|---|---|
/metrics/by-model |
GET | All models stats (JSON) |
/metrics/by-model/prometheus |
GET | Per-model Prometheus metrics |
/metrics/by-model/{id} |
GET | Specific model stats |
/metrics/compare |
GET | Compare two models (query params) |
/metrics/ranking |
GET | Rank models by metric |
/metrics/costs |
GET | Cost breakdown by model |
/metrics/multi-dimensional |
GET | 2D breakdowns (modelĂ—endpoint, modelĂ—user, modelĂ—time) |
| Endpoint | Methods | Description |
|---|---|---|
/kv-cache/metrics |
GET | KV cache and smart routing stats |
/dynamo/metrics |
GET | AI-Dynamo metrics (Prometheus) |
/dynamo/metrics/json |
GET | AI-Dynamo metrics (JSON) |
| Endpoint | Methods | Description |
|---|---|---|
/dcgm/metrics |
GET | DCGM GPU metrics (Prometheus) |
/dcgm/metrics/json |
GET | DCGM GPU metrics (JSON) |
| Endpoint | Methods | Description |
|---|---|---|
/metrics/rate-limits |
GET | Comprehensive rate limiting metrics |
/metrics/rate-limits/key/{key} |
GET | Per-key statistics |
/metrics/rate-limits/tier |
GET | Per-tier aggregations |
/metrics/rate-limits/throttle-analytics |
GET | Throttling analytics |
/metrics/rate-limits/abuse-patterns |
GET | Abuse pattern detection |
FakeAI includes comprehensive NVIDIA AI infrastructure simulation with real implementations (not stubs).
Advanced KV cache management and smart routing
Features:
- Radix Tree Prefix Matching - SGLang-style efficient prefix matching
- Block-level Caching - Configurable block size (default: 16 tokens)
- Multi-worker Simulation - Simulates distributed workers
- Smart Request Routing - Cost-based routing with cache overlap scoring
- Prefix Caching - Automatic shared prompt detection
- Cache Metrics - Hit rates, token reuse, overlap statistics
Configuration:
export FAKEAI_KV_CACHE_ENABLED=true
export FAKEAI_KV_CACHE_BLOCK_SIZE=16
export FAKEAI_KV_CACHE_NUM_WORKERS=4
export FAKEAI_KV_OVERLAP_WEIGHT=1.0
fakeai serverMetrics:
curl http://localhost:8000/kv-cache/metricsBenefits:
- Realistic TTFT speedup on cache hits (60-80% reduction)
- Simulates cache warming and reuse patterns
- Worker load balancing with cache affinity
100+ GPU telemetry metrics in Prometheus format
Simulated Metrics:
- GPU Utilization - Compute, memory, tensor core activity
- Temperature - GPU, memory, thermal throttling
- Power - Current draw, limits, violations
- Memory - Used, free, bandwidth, ECC errors
- Clock Frequencies - SM clock, memory clock, throttling
- NVLink - Traffic, bandwidth, topology
- Health Status - Thermal violations, power throttling, ECC errors
- Multi-GPU - Coordination, load balancing
- PCIe - Replay counters, bandwidth saturation
- Process Tracking - Per-process GPU/memory usage
Supported GPU Models:
- NVIDIA A100 (80GB)
- NVIDIA H100 (80GB)
- NVIDIA H200 (141GB)
- NVIDIA B100/B200 (Blackwell)
Configuration:
export FAKEAI_DCGM_GPU_MODEL=H100-80GB
export FAKEAI_DCGM_GPU_COUNT=8
export FAKEAI_DCGM_WORKLOAD_INTENSITY=high
fakeai serverPrometheus Endpoint:
curl http://localhost:8000/dcgm/metricsGrafana Integration:
- 100% compatible with NVIDIA DCGM dashboards
- Pre-configured Prometheus exporters
- Real-time GPU monitoring visualization
Video understanding and token calculation
Features:
- Video Token Calculation - Resolution, duration, FPS-aware
- Frame Extraction - Configurable frame sampling
- Multi-modal Input - Video + text in chat completions
- Detail Levels - Auto, low, high with token scaling
- URL Metadata - Extract video metadata from URLs
Example:
response = client.chat.completions.create(
model="nvidia/cosmos-vision",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this video"},
{"type": "video_url", "video_url": {
"url": "https://example.com/video.mp4?width=512&height=288&duration=5.0&fps=4"
}}
]
}]
)Token Calculation:
- Base tokens: 85
- Per-frame tokens: 10-50 depending on resolution and detail level
- Total = base + (frames Ă— tokens_per_frame)
Reranking API and optimized models
Reranking Endpoint:
POST /v1/rankingExample:
import requests
response = requests.post("http://localhost:8000/v1/ranking", json={
"model": "nvidia/nv-rerank-qa-mistral-4b",
"query": "What is machine learning?",
"documents": [
{"text": "Machine learning is a subset of AI..."},
{"text": "Deep learning uses neural networks..."},
{"text": "Python is a programming language..."}
],
"top_n": 2
})
print(response.json())
# Returns documents ranked by relevanceNIM Models in Catalog:
nvidia/cosmos-vision- Video understandingnvidia/llama-3.1-nemotron-70b-instruct- Optimized Llama 3.1 70Bnvidia/nv-rerank-qa-mistral-4b- Reranking for Q&A
Features:
- Document reranking for RAG pipelines
- Configurable top_n results
- Query-document relevance scoring
- Compatible with NVIDIA NIM format
Comprehensive LLM inference metrics
Tracked Metrics:
-
Latency Breakdown:
- TTFT (Time To First Token)
- ITL (Inter-Token Latency)
- TPOT (Time Per Output Token)
- Queue time, prefill time, decode time
-
Throughput:
- Request throughput (rps)
- Token throughput (tokens/sec)
- Batch efficiency
-
KV Cache:
- Cache hit rate
- Blocks matched
- Overlap scores
-
Worker Statistics:
- Request distribution
- Worker utilization
- Routing costs
Prometheus Endpoint:
curl http://localhost:8000/dynamo/metricsJSON Endpoint:
curl http://localhost:8000/dynamo/metrics/json37+ model-specific latency profiles with realistic TTFT/ITL
Pre-configured profiles for:
- GPT-4, GPT-4o, GPT-3.5 Turbo
- Llama 3, Llama 3.1, Llama 3.2 (8B, 70B, 405B)
- DeepSeek-V3, DeepSeek-R1
- Mixtral 8x7B, 8x22B
- Claude 3.5 Sonnet, Claude 3 Opus
- And 20+ more...
Dynamic Adjustments:
- Prompt length affects TTFT
- KV cache hits reduce TTFT by 60-80%
- Concurrent load adds queuing delays
- Temperature affects generation speed
- Model size scales latency
FakeAI has comprehensive integration with AIPerf (NVIDIA's LLM benchmarking tool) for industry-standard performance testing.
- Full OpenAI API Compatibility - Works seamlessly with AIPerf
- Realistic Timing Simulation - 37+ model-specific latency profiles
- Comprehensive Metrics - TTFT, ITL, TPOT, throughput
- Automated Test Suites - Multi-model, multi-concurrency benchmark runner
- Detailed Reporting - JSON + Markdown reports with comparisons
- CI/CD Integration - Automated benchmarking in GitHub Actions
# Install AIPerf
pip install aiperf
# Start FakeAI with realistic latency
fakeai server --ttft 20 --itl 5
# Run benchmark
aiperf profile \
--model openai/gpt-oss-120b \
--url http://localhost:8000 \
--endpoint-type chat \
--streaming \
--concurrency 100 \
--request-count 1000cd benchmarks
# Quick test (1 config per model)
python run_aiperf_benchmarks.py --quick
# Specific models
python run_aiperf_benchmarks.py \
--models openai/gpt-oss-120b deepseek-ai/DeepSeek-R1
# Custom concurrency levels
python run_aiperf_benchmarks.py --concurrency 50 100 250
# Full sweep
python run_aiperf_benchmarks.py --allLatency:
- TTFT (Time To First Token) - p50, p90, p99
- ITL (Inter-Token Latency) - p50, p90, p99
- TPOT (Time Per Output Token)
- Request Latency - avg, p50, p90, p99
Throughput:
- Request throughput (requests/sec)
- Output token throughput (tokens/sec)
- Input token throughput (tokens/sec)
Token Statistics:
- Input sequence length (avg, min, max, percentiles)
- Output sequence length (avg, min, max, percentiles)
- Performance Regression Testing - Detect performance changes
- Model Comparison - Compare different model configurations
- Load Testing - Test system under various concurrency levels
- API Compatibility - Validate OpenAI API compliance
- CI/CD Integration - Automated performance testing
Retrieval-augmented generation with document filtering
POST /rag/api/promptExample:
import requests
response = requests.post("http://localhost:8000/rag/api/prompt", json={
"query": "What is PVTMC?",
"filters": {"family": "Solido", "tool": "SDE"},
"inference_model": "meta-llama/Llama-3.1-70B-Instruct",
"top_k": 5
})
result = response.json()
print(result["content"])
print(f"Retrieved {len(result['retrieved_docs'])} documents")Features:
- Document retrieval with filtering
- Context-aware response generation
- Configurable top_k results
- Multi-tool support
O1-style chain-of-thought reasoning
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Solve: 2x + 5 = 13"}]
)
print(response.choices[0].message.reasoning_content)
print(f"Reasoning tokens: {response.usage.reasoning_tokens}")Supported Models:
openai/gpt-oss-120b- OpenAI O1-style reasoningdeepseek-ai/DeepSeek-R1- DeepSeek reasoning model
Speculative decoding for 3-5Ă— speedup
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "The capital of France is"}],
prediction={
"type": "content",
"content": "Paris, and the capital of Germany is Berlin"
}
)
print(f"Accepted: {response.usage.accepted_prediction_tokens}")
print(f"Rejected: {response.usage.rejected_prediction_tokens}")JSON Schema validation
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Generate a person profile"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"strict": True,
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"skills": {"type": "array", "items": {"type": "string"}}
},
"required": ["name", "age"]
}
}
}
)Parallel tool execution
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}
]
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "What's the weather in SF and NYC?"}],
tools=tools,
tool_choice="auto"
)Multi-modal image input
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {
"url": "https://example.com/image.jpg",
"detail": "high"
}}
]
}]
)Configurable failure simulation for testing
export FAKEAI_ERROR_INJECTION_ENABLED=true
export FAKEAI_ERROR_INJECTION_RATE=0.15 # 15% error rate
export FAKEAI_ERROR_INJECTION_TYPES='["internal_error", "service_unavailable"]'
fakeai serverError Types:
internal_error(500)bad_gateway(502)service_unavailable(503)gateway_timeout(504)rate_limit_quota(429)context_length_exceeded(400)
# Server
FAKEAI_HOST=0.0.0.0 # Server host
FAKEAI_PORT=8000 # Server port
FAKEAI_DEBUG=false # Debug mode
# Authentication
FAKEAI_REQUIRE_API_KEY=true # Require API key
FAKEAI_API_KEYS=key1,key2,key3 # Comma-separated keys
FAKEAI_HASH_API_KEYS=false # SHA-256 hashing
# Timing
FAKEAI_TTFT_MS=20 # Time to first token (ms)
FAKEAI_TTFT_VARIANCE_PERCENT=10 # TTFT variance (%)
FAKEAI_ITL_MS=5 # Inter-token latency (ms)
FAKEAI_ITL_VARIANCE_PERCENT=10 # ITL variance (%)
# KV Cache (AI-Dynamo)
FAKEAI_KV_CACHE_ENABLED=true # Enable KV cache
FAKEAI_KV_CACHE_BLOCK_SIZE=16 # Block size (tokens)
FAKEAI_KV_CACHE_NUM_WORKERS=4 # Simulated workers
FAKEAI_KV_OVERLAP_WEIGHT=1.0 # Cache overlap weight
# Rate Limiting
FAKEAI_RATE_LIMIT_ENABLED=false # Enable rate limiting
FAKEAI_RATE_LIMIT_TIER=tier-1 # Tier (tier-1 through tier-5)
FAKEAI_RATE_LIMIT_RPM=500 # Requests per minute
FAKEAI_RATE_LIMIT_TPM=10000 # Tokens per minute
# Error Injection
FAKEAI_ERROR_INJECTION_ENABLED=false # Enable error injection
FAKEAI_ERROR_INJECTION_RATE=0.0 # Error rate (0.0-1.0)
# Security
FAKEAI_ENABLE_ABUSE_DETECTION=false # Enable abuse detection
FAKEAI_ENABLE_INPUT_VALIDATION=false # Enable input validation
# CORS
FAKEAI_CORS_ALLOWED_ORIGINS=* # Allowed origins
FAKEAI_CORS_ALLOW_CREDENTIALS=true # Allow credentialsfakeai server --help
Options:
--host TEXT Server host (default: 0.0.0.0)
--port INTEGER Server port (default: 8000)
--debug Enable debug mode
--ttft FLOAT Time to first token in ms (default: 20)
--itl FLOAT Inter-token latency in ms (default: 5)
--require-api-key Require API key authentication
--api-keys TEXT Comma-separated API keys
--kv-cache-enabled Enable KV cache simulation
--rate-limit-enabled Enable rate limitingpip install fakeaigit clone https://github.com/ajcasagrande/fakeai.git
cd fakeai
pip install -e .# Development tools
pip install -e ".[dev]"
# LLM generation (tiktoken, transformers, torch)
pip install -e ".[llm]"
# Semantic embeddings (sentence-transformers)
pip install -e ".[embeddings]"
# Vector stores (faiss)
pip install -e ".[vector]"
# All features
pip install -e ".[all]"# Start with zero latency for fast iteration
fakeai server --ttft 0 --itl 0
# Test your application
python my_app.pyimport pytest
from openai import OpenAI
@pytest.fixture
def client():
return OpenAI(api_key="test", base_url="http://localhost:8000")
def test_chat_completion(client):
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Test"}]
)
assert response.choices[0].message.content# .github/workflows/test.yml
name: Test
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Start FakeAI
run: |
pip install fakeai
fakeai server --ttft 0 --itl 0 &
sleep 5
- name: Run tests
run: pytest tests/# Establish baseline with AIPerf
aiperf profile \
--model openai/gpt-oss-120b \
--url http://localhost:8000 \
--endpoint-type chat \
--streaming \
--concurrency 100 \
--request-count 1000# Test at various concurrency levels
for concurrency in 10 50 100 250 500; do
aiperf profile \
--model openai/gpt-oss-120b \
--url http://localhost:8000 \
--endpoint-type chat \
--concurrency $concurrency \
--request-count 500
done- CLI Usage - Command-line interface guide
- API Key Guide - Authentication setup
- Docker - Docker deployment
- API Reference - Complete API documentation
- Endpoints - All available endpoints
- Schemas - Request/response schemas
- Examples - Code examples
- Realtime API - WebSocket streaming
- Features Overview - Complete feature list
- Reasoning Support - Advanced reasoning
- Structured Outputs - JSON schema validation
- Tool Calling - Function calling
- Multimodal - Vision, audio, video
- Image Generation - Image creation
- Semantic Embeddings - Vector embeddings
- Streaming - Advanced streaming
- Safety - Content moderation
- AWS Deployment - Deploy to AWS
- Azure Deployment - Deploy to Azure
- Cloud Run - Deploy to GCP Cloud Run
- Kubernetes - Deploy to Kubernetes
- HTTP/2 Guide - Enable HTTP/2
- Configuration Reference - All config options
- Configuration Summary - Quick reference
- Context Validator - Context length validation
- Monitoring System - Metrics and monitoring
- Metrics Streaming - Real-time metrics
- Model Metrics - Per-model tracking
- Operations - Operational guide
- Performance - Performance benchmarks
- Performance Tuning - Optimization guide
- Contributing - Contribution guidelines
- Architecture - System architecture
- Development Guide - Developer setup
- Testing - Testing guide
- CLAUDE.md - AI assistant knowledge base
- Migration Guide - Version upgrades
- Middleware Architecture - Middleware system
- Changelog - Version history
- Security - Security features
- Client SDK - SDK documentation
- Error Injection - Testing with errors
Background research and technical analysis documents:
- DCGM Health Metrics - DCGM health monitoring metrics
- DCGM Profiling - GPU profiling with DCGM
- Dynamo Inference Metrics - AI-Dynamo metrics system
- Fine-tuning - Fine-tuning API research
- GPU Architecture Metrics - Comprehensive GPU metrics catalog
- gRPC HTTP/2 - gRPC and HTTP/2 analysis
- Realtime API - OpenAI Realtime API research
- TensorRT-LLM Metrics - TensorRT-LLM performance metrics
- Triton Metrics - NVIDIA Triton metrics
- Usage Billing API - OpenAI usage tracking research
When the server is running:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Metrics Dashboard: http://localhost:8000/dashboard
- Dynamo Dashboard: http://localhost:8000/dashboard/dynamo
# All tests (2,500+ tests)
pytest -v
# Specific module
pytest tests/test_embedding_service.py -v
# With coverage
pytest --cov=fakeai --cov-report=html
# Specific markers
pytest -m unit -v # Unit tests
pytest -m integration -v # Integration tests
pytest -m service -v # Service layer testsFakeAI is 100% compatible with:
- OpenAI Python SDK (v1.0+)
- OpenAI Node SDK (v4.0+)
- NVIDIA AIPerf (v1.0+)
- NVIDIA NIM - Native NIM endpoint support
- LangChain (via OpenAI integration)
- LlamaIndex (via OpenAI integration)
- Any OpenAI-compatible client
- Python 3.10+
- FastAPI - Web framework
- Pydantic v2 - Data validation
- uvicorn - ASGI server
- hypercorn - HTTP/2 support
- numpy - Numerical operations
- faker - Realistic data generation
FakeAI is built with 90+ modular components organized into:
- 4 core modules - app, service, CLI, async server
- 11 configuration modules - Type-safe, domain-specific configs
- 7 model modules - Organized by feature (chat, embeddings, images, audio, batches)
- 9 registry modules - Model catalog with fuzzy matching and capabilities
- 8 service modules - Single-responsibility business logic
- 8 shared utilities - Zero code duplication
- 18 metrics systems - Production-grade monitoring
- 6 content generation modules - Optional ML integration
- 10+ infrastructure modules - Security, rate limiting, file management
Design Principles:
- Single Responsibility - Each module has one clear purpose
- Zero Duplication - Shared utilities eliminate repetition
- Test-Driven - 2,500+ tests with behavior-driven design
- Type-Safe - Full type hints with Python 3.10+ syntax
- Thread-Safe - Singleton patterns with locks
- Async Throughout - High-performance async/await
- Production-Ready - Battle-tested patterns
Contributions are welcome! See CONTRIBUTING.md for guidelines.
# Clone repository
git clone https://github.com/ajcasagrande/fakeai.git
cd fakeai
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest -v
# Format code
black fakeai/ && isort fakeai/
# Run linters
flake8 fakeai/
mypy fakeai/Apache-2.0
- Issues: https://github.com/ajcasagrande/fakeai/issues
- Discussions: https://github.com/ajcasagrande/fakeai/discussions
FakeAI is built with production-grade engineering practices and is actively used for development, testing, and benchmarking of AI applications. Special thanks to:
- NVIDIA AI-Dynamo - KV cache and smart routing inspiration
- NVIDIA NIM - Inference microservices standards
- NVIDIA DCGM - GPU telemetry standards
- NVIDIA Cosmos - Video understanding capabilities
- AIPerf - Comprehensive benchmarking framework
- Solido - RAG integration patterns
- OpenAI - API specification and standards
Note: FakeAI is a simulation server for testing and development. For production inference, use actual inference servers like NVIDIA Dynamo.
