"AgReasoning Benchmark", which introduces a large-scale question-answering (QA) benchmark tailored to the agricultural domain.
- Goal: Benchmark LLMs and reasoning models on domain-specific agronomic QA tasks.
- Dataset: 55K expert-in-the-loop QA pairs covering diverse agricultural question categories.
- Key Contributions:
- A multi-stage flowchart-driven pipeline for dataset curation.
- Evaluation framework using LLM-as-a-Judge.
- A distilled model that matches larger models in performance with higher efficiency.