This repository contains the code and datasets for evaluating large language models (LLMs) on the numeric temporal reasoning dataset TempAnswerQA, sampled from Test of Time (ToT) and TempTabQA (TTQA). This projects implements a comprehensive evaluation framework using regression-like metrics.
The corresponding paper titled "Time to Rethink Exact Match" has been accepted to 2025 EMNLP Findings.
The codebase provides tools for:
- Running inference on TempAnswerQA using Hugging Face transformers
- Parsing model responses into numeric, time-aware objects
- Evaluating model responses with symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE)
- Python ≥ 3.11
- CUDA-compatible GPU (recommended for model inference)
- Hugging Face account with access token
-
Clone the repository:
-
Install dependencies using uv (recommended):
uv syncOr using pip:
pip install -e .- Set up environment variables:
You are expected to set your Hugging Face token in an .env file since our experiments used access-restricted Llama models.
# Create a .env file with your Hugging Face token
echo "HF_TOKEN=your_huggingface_token_here" > .envSince TempAnswerQA consists of ToT and TTQA, we still refer to both datasets and their splits by theirs names.
A synthetic temporal reasoning dataset with two categories:
- Arithmetic: Date calculations, duration computations, and temporal arithmetic
- Semantic: Temporal logic with graph as context
A dataset on semi-structured Wikipedia tables with temporal, entity-based questions based with two splits:
- Head: Questions about more prominent entities
- Tail: Questions about less-frequented entities
The main interface is through the CLI using main.py:
# Few-shot prompting on arithmetic split
python main.py inference-tot "meta-llama/Llama-3.1-8B-Instruct" add_generation_prompt few-shot arithmetic
# Zero-shot prompting on semantic split
python main.py inference-tot "meta-llama/Llama-3.1-8B-Instruct" continue_final_message zero-shot semantic# Few-shot prompting on head split
python main.py inference-ttqa "meta-llama/Llama-3.1-8B-Instruct" add_generation_prompt few-shot head
# Zero-shot prompting on tail split
python main.py inference-ttqa "meta-llama/Llama-3.1-8B-Instruct" continue_final_message zero-shot tailThese scripts will calculate sMAPE, MASE and EM for all model responses generated in the above step.
python main.py evaluate-tot data/responses/ continue_final_messagepython main.py evaluate-ttqa data/responses/ add_generation_prompt - model_name: Hugging Face model identifier (e.g.,
meta-llama/Llama-3.1-8B-Instruct) - last_token: Token handling strategy
add_generation_prompt: Adds generation prompt to the chat templatecontinue_final_message: Continues from the final message
- prompting: Prompting strategy
few-shot: Uses example demonstrationszero-shot: No examples provided
- split: Dataset split
- ToT:
arithmeticorsemantic - TTQA:
headortail
- ToT:
- test_mode: Boolean flag for testing with a small subset of data
temp-answer-qa/
├── main.py # CLI interface
├── temp_answer_qa/ # Main package
│ ├── __init__.py # Core enums and constants
│ ├── chat_builder.py # Chat template builders
│ ├── data_loader.py # Dataset loading utilities
│ ├── evaluate.py # Evaluation pipeline
│ ├── inference.py # Model inference
│ ├── measure_error.py # Parsing and metric application
│ ├── metrics.py # Evaluation metrics
│ ├── models.py # Hugging Face model wrapper
│ └── response_processing.py # Response parsing and processing
├── data/
│ ├── prompts/ # Few-shot examples and system prompts
│ ├── questions/ # Dataset files (tot.csv, ttqa.csv)
│ ├── responses/ # Generated model responses
│ └── responses_evaluated/ # Evaluation results
└── tests/ # Unit tests
question: Full question with formatting instructionslabel: Ground truth answer as dictionaryquestion_wo_instruct: Question without formatting instructionsinstruction: JSON formatting instructionsanswer_format: Expected answer formatanswer_temporal_unit: Type of temporal unit (date, days, months, etc.)split: Dataset split (arithmetic/semantic)
question: Question about the tablelabel: Ground truth answertable_context: Structured table dataanswer_format: Expected answer formatanswer_temporal_unit: Type of temporal unitsplit: Dataset split (head/tail)
This repo underwent refactoring after submission. During that process, we found a few issues.
Clustering depends on the order of the data, which we did not adequately control for during experiments. Therefore, we need to apply the same data ordering as done for the paper to reproduce results (applied in evaluate.py, using a DataFrame's index).
Despite using the same library versions and getting the same clusters using the same values, this new version of the code exhibits small differences in MASE scores for the ToT dataset. We suspect numeric instabilities when calculating the centroid to be the reason.
After refactoring we also found a mistake when calculating MASE in TTQA for 137 examples. Instead of using an error based on the timestamp of the date, we used the number of days. The difference in scores is, however, low.
The uploaded results are the ones we generated and used for the paper.
The code in this repository is licensed under the MIT License. See the LICENSE file for details.
The datasets in data/questions/ are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See data/questions/LICENSE for details.
Other files under data/ may include artifacts or evaluation outputs; they retain the licenses of their respective sources unless otherwise noted.