Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
-
Updated
Mar 31, 2026 - Python
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark large language models on real expert tasks using a YAML-driven pipeline and live dashboard for the GDPVal Gold Subset.
Add a description, image, and links to the gdpval topic page so that developers can more easily learn about it.
To associate your repository with the gdpval topic, visit your repo's landing page and select "manage topics."