Skip to content

Knowledge distillation pipeline for training a smaller CodeT5-based model to generate test assertions. Part of a Bachelor Thesis at TU Delft focusing on efficient automated test assertion generation.

License

Notifications You must be signed in to change notification settings

AndreyVLD/DistilT5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

DistilT5

A model distillation pipeline for automatically generating assertions from test methods. This project is part of a Bachelor's Thesis at Delft University of Technology for the year 2025.

Overview

This project implements a knowledge distillation approach for a CodeT5-based model to generate assertions for test methods. The pipeline includes training, evaluation, inference speed measurement components, and auxiliary code for logit decompression and plotting of results.

Requirements

  • Python 3.13.2
  • The remaining requirements are specified in the requirements.txt file, along with their respective versions.
  • (Optional) CUDA GPU with CUDA 12.2 support for faster training and evaluation.

Installation

To install the required packages, run:

git clone https://github.com/AndreyVLD/DistilT5
cd DistilT5
pip install -r requirements.txt

Project Structure

  • src/ - Source code directory.
    • pipeline/ - Contains the main pipeline for training and evaluation.
      • dataset.py - Dataset handling.
      • model.py - Student model and loss function implementation.
      • train.py - Training and evaluation logic for distillation together with the configuration class for this.
    • plots/ - Contains the code for plotting results.
      • metrics.py - Runnable file for generating plots.
    • utils/ - Contains utility functions.
      • decompression.py - Logit decompression utilities.
      • evaluation.py - Class for keeping track of evaluation metrics and teacher evaluation utilities.
      • split_json.py - Utilities for splitting JSON files in training and validation.
    • __main__.py - Entry point with training and evaluation functions.

Usage

Instructions

Before running the training and evaluation process, ensure:

  1. You have the required datasets in the data/ directory:
    • Training data: DistilT5/data/distillation_data_training.jsonl
    • Validation data: DistilT5/data/distillation_data_validation.jsonl
  2. You can customize the distillation process by modifying fields in the DistillationConfig class.
  3. In the evaluation functions from __main__.py, you can specify the path to the model you want to evaluate by changing model_path variable. This path needs to point to the directory where the model is saved. If the model is trained by our script then it needs to point to DistilT5/distillation_output/epoch_<placeholder_for_epoch_number> directory.

The output will be saved to DistilT5/distillation_output by default, but this can be changed in the configuration.

Training

To train the model, edit the __main__.py file and uncomment the train() function call in the main() function:

def main() -> None:
    set_seed(42)
    train()
    # evaluate()
    # evaluate_with_time()

Then run the script:

python -m src

Evaluation

To evaluate a trained model, uncomment the evaluate() function call:

def main() -> None:
    set_seed(42)
    # train()
    evaluate()
    # evaluate_with_time()

Measuring Generation Speed

To measure the inference speed of the model, uncomment the evaluate_with_time() function call:

def main() -> None:
    set_seed(42)
    # train()
    # evaluate()
    evaluate_with_time()

Configuring Distillation

The model and training settings can be configured by modifying the DistillationConfig class in pipeline/train.py.

License

This project is released under the MIT License, see the LICENSE file for details.

Acknowledgements

This project is part of a Bachelor Thesis at Delft University of Technology. Special thanks to the Software Engineering Research Group (SERG) from the EEMCS Faculty at TU Delft and to my supervisors: Professor Mitchell Olsthoorn and Professor Annibale Panichella for their guidance and support throughout the project.

Citation

@software{distilt5_2025,
    author = {Andrei Vlad Nicula},
    title = {DistilT5: Distilled CodeT5 for Assertion Generation},
    version = {v1.0.2},
    month        = jun,
    year         = 2025,
    publisher = {Zenodo},
    doi = {10.5281/zenodo.15707159},
    url = {https://doi.org/10.5281/zenodo.15707159},
}

About

Knowledge distillation pipeline for training a smaller CodeT5-based model to generate test assertions. Part of a Bachelor Thesis at TU Delft focusing on efficient automated test assertion generation.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages