BuildingQA: A Benchmark for Natural Language Question Answering over Building Knowledge Graphs

This repository contains the BuildingQA benchmark, a dataset for Natural Language Question Answering over Building Knowledge Graphs (BKGs).

The benchmark consists of four comprehensive knowledge graphs and a curated set of natural language questions, each paired with a ground-truth SPARQL query. As a baseline for this benchmark, we also provide an implementation of a ReAct (Reasoning and Acting) agent designed to translate questions into SPARQL.

The BuildingQA Benchmark Dataset

This benchmark is comprised of two main components:

1. Knowledge Graphs

Contents: Four distinct Building Knowledge Graphs (BKGs).
Location: eval_buildings/
Format: Turtle (.ttl)

2. Question-Answer Pairs

Contents: A collection of natural language questions, each manually paired with a ground-truth SPARQL query.
Location: Benchmark_QA_pairs/
Format: A separate JSON file is provided for each of the four buildings, containing all of its associated questions and queries.

📊 Evaluation Metrics

To evaluate performance on the benchmark, this project uses four metrics to assess query correctness, from basic structure to perfect content matching:

Metric	What it Checks	Strictness	Column Alignment Performed?
Arity Matching F1	The number of columns.	Low	No
Entity Set F1	Sets of unique values within columns.	Medium	Yes (Finds best match)
Row-Matching F1	Row-for-row content.	High	Yes (Finds best match)
Exact-Match F1	Column order and row-for-row content.	Very High	No (Assumes fixed order)

Row-Matching F1 is the primary indicator of a perfectly correct query, as it confirms correct entities and relationships, regardless of column order.

🤖 Included Baseline: ReAct Agent

We provide a baseline agentic workflow to demonstrate how to interact with the benchmark. This agent uses a two-agent loop to iteratively generate and refine SPARQL queries:

Query Writer Agent: Generates and revises SPARQL queries based on a question and graph context.
Critique Agent: Evaluates the query's correctness and results. It either approves the query (FINAL) or provides specific feedback for improvement (IMPROVE).

Key Agent Features

ReAct Framework: Implements a two-agent loop for query generation and refinement.
Structured I/O: Uses Pydantic models for reliable JSON-based LLM communication.
Flexible Data Sources: Natively supports both remote SPARQL endpoints and the local .ttl graph files provided in this benchmark.
Context-Aware: Injects a num_triples subset of the knowledge graph into the agent's context.

🚀 Running the Baseline Agent

1. Prerequisites

Python 3.9+
Access to an LLM API (e.g., OpenAI)
A running SPARQL endpoint (like GraphDB) or local .ttl graph files.

2. Installation

Clone this repository.
Install the required Python packages (e.g., rdflib, sparqlwrapper, openai, pydantic).

pip install rdflib sparqlwrapper openai pydantic numpy pandas

3. Configuration

Open ReAct_demo.ipynb and configure the main parameters in the code cells:

API Key: Set your OPENAI_API_KEY and initialize the client.
Experiment Parameters: Set variables like MODEL_NAME, BUILDING_NAME, num_triples, USE_LOCAL_TTL_FILE, REMOTE_ENDPOINT_URL, and LOG_FILE.

4. Run the Workflow

Open ReAct_demo.ipynb in Jupyter.
Execute the cells. The run_single_question() function is configured for a quick test.
Execute the run_all_buildings() cell to run the full benchmark against all your questions.
Results will be saved to the CSV file specified in your LOG_FILE variable.

Citation

If you use this work in your research, please cite the following paper:

@inproceedings{Mulayim2025BuildingQA,
  author = {Mulayim, Ozan Baris and Anwar, Avia and Saka, Mete and Paul, Lazlo and Prakash, Anand Krishnan and Fierro, Gabe and Pritoni, Marco and Berg'{e}s, Mario},
  title = {{BuildingQA: A Benchmark for Natural Language Question Answering over Building Knowledge Graphs}},
  booktitle = {The 12th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BUILDSYS '25)},
  year = {2025},
  month = nov,
  publisher = {ACM},
  address = {New York, NY, USA},
  pages = {1--11},
  numpages = {11},
  location = {Golden, CO, USA},
  doi = {10.1145/3736425.3770097},
  url = {[https://doi.org/10.1145/3736425.3770097](https://doi.org/10.1145/3736425.3770097)},
  series = {BUILDSYS '25}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Benchmark_QA_pairs		Benchmark_QA_pairs
Experiment_Results		Experiment_Results
ReAct_agent		ReAct_agent
__pycache__		__pycache__
eval_buildings		eval_buildings
.DS_Store		.DS_Store
README.md		README.md
ReAct_demo.ipynb		ReAct_demo.ipynb
metrics.py		metrics.py
metrics_demo.ipynb		metrics_demo.ipynb
plot_cost_vs_performance.py		plot_cost_vs_performance.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BuildingQA: A Benchmark for Natural Language Question Answering over Building Knowledge Graphs

The BuildingQA Benchmark Dataset

1. Knowledge Graphs

2. Question-Answer Pairs

📊 Evaluation Metrics

🤖 Included Baseline: ReAct Agent

Key Agent Features

🚀 Running the Baseline Agent

1. Prerequisites

2. Installation

3. Configuration

4. Run the Workflow

Citation

About

Uh oh!

Releases

Packages

Languages

INFERLab/BuildingQA

Folders and files

Latest commit

History

Repository files navigation

BuildingQA: A Benchmark for Natural Language Question Answering over Building Knowledge Graphs

The BuildingQA Benchmark Dataset

1. Knowledge Graphs

2. Question-Answer Pairs

📊 Evaluation Metrics

🤖 Included Baseline: ReAct Agent

Key Agent Features

🚀 Running the Baseline Agent

1. Prerequisites

2. Installation

3. Configuration

4. Run the Workflow

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages