A model distillation pipeline for automatically generating assertions from test methods. This project is part of a Bachelor's Thesis at Delft University of Technology for the year 2025.
This project implements a knowledge distillation approach for a CodeT5-based model to generate assertions for test methods. The pipeline includes training, evaluation, inference speed measurement components, and auxiliary code for logit decompression and plotting of results.
- Python 3.13.2
- The remaining requirements are specified in the
requirements.txtfile, along with their respective versions. - (Optional) CUDA GPU with CUDA 12.2 support for faster training and evaluation.
To install the required packages, run:
git clone https://github.com/AndreyVLD/DistilT5
cd DistilT5
pip install -r requirements.txtsrc/- Source code directory.pipeline/- Contains the main pipeline for training and evaluation.dataset.py- Dataset handling.model.py- Student model and loss function implementation.train.py- Training and evaluation logic for distillation together with the configuration class for this.
plots/- Contains the code for plotting results.metrics.py- Runnable file for generating plots.
utils/- Contains utility functions.decompression.py- Logit decompression utilities.evaluation.py- Class for keeping track of evaluation metrics and teacher evaluation utilities.split_json.py- Utilities for splitting JSON files in training and validation.
__main__.py- Entry point with training and evaluation functions.
Before running the training and evaluation process, ensure:
- You have the required datasets in the
data/directory:- Training data:
DistilT5/data/distillation_data_training.jsonl - Validation data:
DistilT5/data/distillation_data_validation.jsonl
- Training data:
- You can customize the distillation process by modifying fields in the
DistillationConfigclass. - In the evaluation functions from
__main__.py, you can specify the path to the model you want to evaluate by changingmodel_pathvariable. This path needs to point to the directory where the model is saved. If the model is trained by our script then it needs to point toDistilT5/distillation_output/epoch_<placeholder_for_epoch_number>directory.
The output will be saved to DistilT5/distillation_output by default, but this can be changed in the configuration.
To train the model, edit the __main__.py file and uncomment the train() function call in the main() function:
def main() -> None:
set_seed(42)
train()
# evaluate()
# evaluate_with_time()Then run the script:
python -m srcTo evaluate a trained model, uncomment the evaluate() function call:
def main() -> None:
set_seed(42)
# train()
evaluate()
# evaluate_with_time()To measure the inference speed of the model, uncomment the evaluate_with_time() function call:
def main() -> None:
set_seed(42)
# train()
# evaluate()
evaluate_with_time()The model and training settings can be configured by modifying the DistillationConfig class in pipeline/train.py.
This project is released under the MIT License, see the LICENSE file for details.
This project is part of a Bachelor Thesis at Delft University of Technology. Special thanks to the Software Engineering Research Group (SERG) from the EEMCS Faculty at TU Delft and to my supervisors: Professor Mitchell Olsthoorn and Professor Annibale Panichella for their guidance and support throughout the project.
@software{distilt5_2025,
author = {Andrei Vlad Nicula},
title = {DistilT5: Distilled CodeT5 for Assertion Generation},
version = {v1.0.2},
month = jun,
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.15707159},
url = {https://doi.org/10.5281/zenodo.15707159},
}