You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A reproducible benchmarking pipeline for machine learning models, focused on analyzing inference efficiency. It captures CPU utilization, MACs, memory consumption (RSS), model size, and runtime-specific resource demands, combined with hardware-aware profiling for consistent cross-system evaluation.
Comprehensive LLM evaluation framework comparing local and cloud models with hardware-aware benchmarking. Evaluate across code generation, document analysis, and structured output using pass@k, LLM-as-Judge, and RAG metrics. Supports Ollama, Google Gemini, Anthropic, and OpenAI.