This project aims to use off-the-shelf large language models for text-to-SQL program sysnthesis tasks. After experimenting with various models, fine-tuning hyperparameters, and training datasets an optimal solution was identified by fine-tuning the WizardLM/WizardCoder-15B-V1.0 base model using QLoRA techniques on this customized Spider training dataset. The resultant model, richardr1126/spider-skeleton-wizard-coder-merged, demonstrates 63.7% execution accuracy when evaluated. The project utilizes a custom validation dataset that incorporates database context into the question. A live demonstration of the model is available on Hugging Face Space, facilitated by the Gradio library for user-friendly GUI.
Spider Skeleton WizardCoder - test-suite-sql-eval Results
With temperature set to 0.0, top_p set to 0.9, and top_k set to 0, the model achieves 63.7% execution accuracy on the Spider dev set w/ database context.
Note:
- ChatGPT was evaluated with the default hyperparameters and with the system message
You are a sophisticated AI assistant capable of converting text into SQL queries. You can only output SQL, don't add any other text. - Both models were evaluated with
--plug_valueinevaluation.pyusing the Spider dev set with database context.--plug_value: If set, the gold value will be plugged into the predicted query. This is suitable if your model does not predict values. This is set toFalseby default.
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.
This dataset was used to finetune this model.
git lfs install && \
git clone https://github.com/cuplv/text-to-sql-wizardcoder.gitcd text-to-sql-wizardcoder/sql-skeleton-wizardcoder-demo && \
git submodule update --init --recursive- The best way to run this model locally is to use the 4-bit GGML version on koboldcpp, with CuBlas support.
- With 8GB of GPU-VRAM on an NVIDIA GPU and 16GB of CPU-RAM, I stabley offloaded 20 layers, half of the model into VRAM, which helps the prompt processing speed tremendously.
- Using
koboldcppwill create a local REST API that you can use to generate predictions. If you want to use a sepeerate computer to generate predictions, you can use Ngrok to create a public URL for your local REST API.
Evaluating the model on Spider validation set
To install the necessary dependencies, you should create a new Conda environment and install the required packages using the requirements.txt file.:
conda create -n text-to-sql && \
conda activate text-to-sql && \
pip install -r requirements.txtpip install -r requirements.txt
The requirements.txt file contains the following packages:
transformers
datasets
tqdm
torch
numpy
scipy
gradio_client
python-dotenvThe generate-finetune-data.py script is a Python script that generates fine-tuning data for the model.
This script allows you to select the mode of data generation (train, validation, both), the SQL type (natsql, sql), and whether to use the SQL skeleton in the output sequence.
python generate-finetune-data.py --mode [MODE] --sql_type [SQL_TYPE] --skeleton--mode [MODE]: Specifies the mode of data generation. Replace[MODE]with one oftrain,validation, orboth. By default, the mode is set toboth.--sql_type [SQL_TYPE]: Specifies the SQL type used. Replace[SQL_TYPE]with eithernatsqlorsql.--skeleton: Use SQL skeleton in the output sequence.
Generate training and validation data for a natsql type model with skeleton:
python generate-finetune-data.py --mode both --sql_type natsql --skeletonGenerate training data for a sql type model without skeleton:
python generate-finetune-data.py --mode train --sql_type sqlUse the gen_predictions_hf_spaces.ipynb notebook to generate predictions from spider-skeleton-wizard-coder model using the Hugging Face space API.
Use the gen_predictions_koboldcpp.ipynb notebook to generate predictions from a model using a local Ngrok REST API.
The evaluation.py script is used to evaluate the quality of the predictions generated by the model. To evaluate your predictions, use the following command:
cd eval
python evaluation.py --plug_value --input predictions/temp0_skeleton_best.txt-
--input: Specifies the path to the input file that contains the predicted queries. This argument is required. -
--gold: Specifies the path to the gold queries. This argument is optional and defaults to an empty string. -
--db: Specifies the directory that contains all the databases and test suites. By default, it points to the./data/databasedirectory. -
--table: Specifies thetables.jsonschema file. By default, this argument is an empty string. -
--etype: Specifies the evaluation type. It can beall,execfor test suite accuracy, ormatchfor the original exact set match accuracy. The default value isexec. -
--plug_value: If set, the gold value will be plugged into the predicted query. This is suitable if your model does not predict values. This is set toFalseby default. -
--keep_distinct: If set, the DISTINCT keyword will be kept during evaluation. This is set toFalseby default. -
--progress_bar_for_each_datapoint: If set, a progress bar for running test inputs for each datapoint will be displayed. This is set toFalseby default. -
--natsql: If set, the script will convert natsql to SQL and evaluate the converted SQL. This is set toFalseby default.
Based on the input file name, if it contains "natsql", the --natsql flag will be automatically set to True. Also, if --natsql is true, the output file path is prepared by appending "2sql" before ".txt", and gold and table paths are adjusted accordingly.
If --natsql is true, the predicted queries are first converted to SQL by running the convert_natsql_to_sql.py script in a subprocess.
@misc{luo2023wizardcoder,
title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct},
author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang},
year={2023},
}@article{yu2018spider,
title={Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task},
author={Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
journal={arXiv preprint arXiv:1809.08887},
year={2018}
}@inproceedings{gan-etal-2021-natural-sql,
title = "Natural {SQL}: Making {SQL} Easier to Infer from Natural Language Specifications",
author = "Gan, Yujian and
Chen, Xinyun and
Xie, Jinxia and
Purver, Matthew and
Woodward, John R. and
Drake, John and
Zhang, Qiaofu",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.174",
doi = "10.18653/v1/2021.findings-emnlp.174",
pages = "2030--2042",
}@article{dettmers2023qlora,
title={QLoRA: Efficient Finetuning of Quantized LLMs},
author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:2305.14314},
year={2023}
}@inproceedings{li2022resdsql,
author = {Haoyang Li and Jing Zhang and Cuiping Li and Hong Chen},
title = "RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL",
booktitle = "AAAI",
year = "2023"
}