Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
-
Updated
Mar 31, 2026 - Python
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark large language models on real expert tasks using a YAML-driven pipeline and live dashboard for the GDPVal Gold Subset.
Add a description, image, and links to the self-qa topic page so that developers can more easily learn about it.
To associate your repository with the self-qa topic, visit your repo's landing page and select "manage topics."