Hidden in the Haystack: Smaller Needles are More Difficult for LLMs to Find
We use Conda for environment management. Create and activate the environment:
conda env create -f environment.yml
conda activate lost-in-the-haystack-envModify example.env with your private API keys and rename it to .env:
cp example.env .env
# Edit .env with your credentialsTo recreate benchmark-specific datasets, run the following commands:
-
CARDBiomedBench (CBB): Requires retrieval data (not redistributable due to licensing).
python -m scripts.prep --config configs/benchmarks/cbb.yaml
-
NaturalQuestions (NQ): Requires downloading HELMET data from here and placing it under
data/raw/.python -m scripts.prep --config configs/benchmarks/nq.yaml
-
NuminaMath1.5 (NM): Automatically fetched from HuggingFace.
python -m scripts.prep --config configs/benchmarks/nm.yaml
Experiments are executed by pairing a model with a benchmark. Example execution:
python -m scripts.run --exp-config configs/experiments/cbb_gemini2flash.yamlReplace the config path above with the desired benchmark and model combination. Available configs are located under configs/experiments/.
Analysis can be performed at two levels:
-
Benchmark-wide analysis:
python -m scripts.analyze --bench-config configs/benchmarks/cbb.yaml python -m scripts.analyze --bench-config configs/benchmarks/nm.yaml python -m scripts.analyze --bench-config configs/benchmarks/nq.yaml
-
Model-specific experiment analysis:
python -m scripts.analyze --exp-config configs/experiments/cbb_gemini2flash.yaml
Graphs and analysis outputs are generated automatically.
.
├── configs
│ ├── benchmarks # Benchmark configurations
│ ├── experiments # Experiment-specific benchmark_model pairings
│ └── models # Model configurations
├── data
│ ├── images # Images and graphs
│ ├── raw # Raw input data
│ └── tasks # Prepared tasks for benchmarks
├── scripts
│ ├── analyze.py # Analysis entry point
│ ├── run.py # Experiment execution entry point
│ ├── prep.py # Data preparation entry point
│ ├── models # Model initialization and clients
│ │ ├── base_llm.py # Abstract model class
│ │ ├── ... # Client-specific llm classes (azure ai, azure openai, google, and huggingface)
│ │ ├── llm_client.py # LLM Factory
│ └── utils
│ ├── cbb_run.py # Benchmark-specific run utils
│ ├── nq_run.py
│ ├── nm_run.py
│ ├── cbb_analyze.py # Benchmark-specific analysis utils
│ ├── nq_analyze.py
│ ├── nm_analyze.py
│ ├── metrics.py # Metric utilities
│ ├── graph_utils.py # Visualization utilities
│ └── utils.py # Helper utilities
├── slurm
│ ├── run_gem2lite.sh # Example SLURM scripts for HPC execution
│ └── ...
├── .gitignore # Gitignore file
├── environment.yaml # Conda environment specification
├── example.env # Template for API keys
└── README.md # This document
The slurm/ directory contains scripts configured for batch execution on HPC clusters using SLURM:
sbatch slurm/run_gem2lite.shEnsure paths and environment settings are correct for your HPC environment.
To add a new LLM:
- Create a new YAML config file under
configs/models/. - Extend the
base_llm.pyabstract class inscripts/models/.
@article{bianchi2025SmallerNeedles,
title = {Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find},
author = {Owen Bianchi and Mathew J. Koretsky and Maya Willey and Chelsea X. Alvarado and Tanay Nayak and Adi Asija and Nicole Kuznetsov and Mike A. Nalls and Faraz Faghri and Daniel Khashabi},
year = 2025,
journal = {arXiv preprint arXiv:2505.18148},
volume = {abs/2505.18148},
url = {https://arxiv.org/abs/2505.18148},
eprint = {2505.18148},
archiveprefix = {arXiv},
primaryclass = {cs.CL},
code = {https://github.com/NIH-CARD/LostInTheHaystack},
}
Enjoy exploring how LLMs handle varying gold context sizes!
