To appear at SIGMOD '26: 2026 International Conference on Management of Data, Bengaluru, India, May 31–June 5, 2026.
To be published in Proceedings of the ACM on Management of Data, Vol. 4, No. 3 (SIGMOD), Article 240 (Jun 2026), 27 pages. https://doi.org/10.1145/3802117.
Full technical report available at technical-report.pdf / arXiv:2603.15465 / EPFL Infoscience 20.500.14299/261601.
- Scala 3.3.1 with sbt 1.6.2
- For DPconv, DuckDB, and Yannakakis+: CMake 4.0.3, GNU Make 3.81, clang 20.1.3
- For join tree enumeration with traditional GYO algorithms (implemented by SparkSQL+): Maven 3.8.6
- For LearnedRewrite, LLM-R2, and reproducing the figures: Python 3.10.1
Please execute all the following commands from the root directory of the repository. The commands and scripts are written on macOS Tahoe 26.0.1 with zsh 5.9 / GNU bash 3.2.57(1). For other operating systems or shells, you may need to adapt the commands accordingly.
Given the need to convert datasets to DuckDB and the large sizes of required files, we provide them all in a setup script. Please run the following script to set up the folder structures and download the files. Note that these files take approximately 26 GB of storage space in total.
bash src/main/scripts/setup.sh
This script will download and unzip the files for the benchmarks, and create folders to be used to store experiment results. After running this script, the folder structure should look like the following:
metaDecomp (root)
|- benchmarks
|- dsb
|- job-large
|- job-original
|- musicbrainz
|- datasets
|- dsb
|- imdb
|- musicbrainz
|- experiment-results
|- figures
|- join-trees
...
If this is not the case (e.g., some new folder is created as the zip file is unzipped), please move the folders accordingly.
The DSB dataset and queries are generated using the code in https://github.com/microsoft/dsb. The IMDB dataset (for JOB and JOBLarge) comes directly from http://event.cwi.nl/da/job/imdb.tgz. The Musicbrainz dataset was downloaded from https://data.metabrainz.org/pub/musicbrainz/data/fullexport/ on September 19, 2025 and converted to the format of DuckDB. The queries are generated using the code in https://github.com/Manciukic/postgres-gpuqo/tree/master/scripts/databases/musicbrainz. All cardinality information is formatted in the same way as in https://github.com/utndatasystems/DPconv. These files that we use in the experiment can all be downloaded using the setup script. No further action is required.
We adapted the code of DPconv for us to use for our experiments. We also included an implementation of UnionDP on top of the code of DPconv. This adapted version is given in another repository, which is a fork of the original DPconv repository.
First clone the repository:
git clone https://github.com/zhekai-jiang/DPconv.git
There should then be a folder DPconv under the root directory of this repository:
metaDecomp (root)
|- DPconv
|- queries
|- src
|- ...
...
Then, build DPconv, as per the instructions in the original DPconv repository:
cd DPconv/src
mkdir -p build
cd build
cmake ..
make
We slightly modified the code of DuckDB to measure the optimization time. The modified code is given in another repository, which is a fork of the original DuckDB repository.
First clone the repository:
# Remember to change back to the root of the project repository
git clone https://github.com/zhekai-jiang/duckdb.git
There should then be a folder duckdb under the root directory of this repository:
metaDecomp (root)
|- duckdb
|- benchmark
|- data
|- examples
|- ...
...
Then, simply run make to build the project:
cd duckdb
make
Simply clone the repository of DuckDBYanPlus as is and checkout the commit we use in our experiments:
# Remember to change back to the root of the project repository
git clone https://github.com/ChampionNan/DuckDBYanPlus.git
cd DuckDBYanPlus
git checkout 38d165e
This is implemented on top of the original DuckDB repository. As for DuckDB, simply run make to build the project:
cd DuckDBYanPlus
make
We use the code from the authors of LLM-R2, which contains the models and scripts for both LLM-R2 and LearnedRewrite. We adapted the code to run the experiments on the benchmarks used in this paper.
First download the code:
# Remember to change back to the root of the project repository
wget "https://drive.usercontent.google.com/download?id=1z4pPsONM99HoVmpSSOyugVf62MslEUbE&export=download&authuser=0&confirm=t&uuid=e3209971-f046-41d9-80b5-1e3b083e31bf&at=AGN2oQ1V7RFhqf15WVor24ve7tBj:1773678735707" -O "llm-r2.zip"
mkdir llm-r2
tar -xzf "llm-r2.zip" -C llm-r2
rm llm-r2.zip
There should then be a folder duckdb under the root directory of this repository:
metaDecomp (root)
|- llm-r2
|- data
|- rules_for_selected
|- src
|- ...
...
Before running the experiments, enter your API key on line 33 of llm_r2_rewriter_pg.py.
Clone the SparkSQLPlus repository, checkout the commit we use in our experiments, and build the project:
# Remember to change back to the root of the project repository
git clone https://github.com/hkustDB/SparkSQLPlus.git
cd SparkSQLPlus
git checkout f22188bba4e971da6defb97c983e06e18e66fd7a
mvn clean package -DskipTests=true
Create and activate a virtual environment:
# Remember to change back to the root of the project repository
python -m venv src/main/python/venv
source src/main/python/venv/bin/activate
Install the required dependencies:
pip install -r src/main/python/requirements.txt
sbt -mem 32000 "runMain experiments.runner.MetaDecompRunner"
(-mem 32000 indicates a limit of 32,000 MB of memory for the JVM. This can be adjusted based on the amount of available memory in your system. For certain queries in JOBLarge, such a large memory may be required due to need to load the cardinality estimation of a very large number of possible intermediate results.)
The results are stored in experiment-results/metadecomp-opt-{dsb, job-original, musicbrainz, job-large}-<timestamp>.csv
sbt "runMain experiments.runner.DPconvRunner"
The results are stored in experiment-results/dpconv-opt-{dsb, job-original, musicbrainz, job-large}-<timestamp>.csv
sbt "runMain experiments.runner.DuckDBRunner"
The results are stored in experiment-results/duckdb-{dsb, job-original, musicbrainz, job-large}-<timestamp>.csv
sbt "runMain experiments.runner.DPconvRunner UnionDP"
The results are stored in experiment-results/uniondp-opt-{dsb, job-original, musicbrainz, job-large}-<timestamp>.csv
sbt "runMain experiments.runner.DuckDBYanPlusRunner"
The results are stored in experiment-results/yanplus-opt-{dsb, job-original, musicbrainz, job-large}-<timestamp>.csv
sbt "runMain experiments.runner.LLMR2Runner"
The results are stored in experiment-results/llm-r2-opt-{dsb, job-original, musicbrainz, job-large}-<timestamp>.csv
sbt "runMain experiments.runner.LearnedRewriteRunner"
The results are stored in experiment-results/learned-rewrite-opt-{dsb, job-original, musicbrainz, job-large}-<timestamp>.csv
bash src/main/scripts/enumerate-join-trees-metadecomp.sh
The results are stored in experiment-results/metadecomp-enum-{dsb, job-original, musicbrainz, job-large}.csv
bash src/main/scripts/enumerate-join-trees-sparksqlplus.sh
The results are stored in experiment-results/sparksqlplus-enum-{dsb, job-original, musicbrainz, job-large}.csv
After obtaining experiment results, the figures in the paper can be reproduced as follows. All figures will be stored in subdirectories under experiment-results/figures as PDF files.
Before executing the scripts below, please remove timestamps from generated csv files. For example, rename experiment-results/metadecomp-opt-dsb-2025-10-16T17-24-04.csv to experiment-results/metadecomp-opt-dsb.csv.
python src/main/python/plot-opt-time.py
Comparison of the costs of optimal width-1 query plans and those of the gloally optimal plans – Figure 9
python src/main/python/plot-cost-ratio.py
python src/main/python/plot-overall-speedups-stacked.py
python src/main/python/plot-speedups-individual.py
python src/main/python/plot-scatter-all.py
python src/main/python/plot-scatter-individual.py
python src/main/python/plot-enum-time.py