Skip to content

Scripts to reproduce the results of the paper "Practical guidelines for the use of Gradient Boosting for molecular property prediction".

License

Notifications You must be signed in to change notification settings

sieber-lab/GBM_Benchmarking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GBM Benchmarking

Alt text Python 3.6 License: MIT
Repository containing the code and the datasets to reproduce the results from the paper "Practical guidelines for the use of Gradient Boosting for molecular property prediction".

Repository structure

  • Datasets: Contains all datasets used in this study as .csv files (must be unzipped first)
  • Scripts: Contains all scripts and utility functions used to reproduce the results
  • Results: Contains the outputs from the pipeline functons

Installation

The necessary environment can be configured via conda from the environment.yml file.

git clone https://github.com/dahvida/GBM_Benchmarking
conda env create --name GBM --file=environment.yml
conda activate GBM

Additionally, you need to install the fANOVA package:

git clone https://github.com/automl/fanova.git
cd fanova/
pip install -r requirements.txt
python setup.py install

Usage

All results can be reproduced by executing the respective pipeline_x.py files in the Scripts folder. The outputs from each script will be saved in the Results folder, either as .csv, .pkl or .txt files.

  • pipeline_benchmark.py: Returns ROC-AUC, PR-AUC, training time and Shapley overlap for all GBM implementations on the selected datasets
  • pipeline_hyperparam.py: Evaluates the importance of each hyperparameter using fANOVA on the selected datasets
  • pipeline_grid.py: Evaluates the performance of the grid with the most important hyperparameters versus optimizing all possible hyperparameters
  • pipeline_fragments.py: Draws the top 20 most important ECFP bits for a given dataset for all GBM implementations. Currently supports only the BACE, BBBP and HIV datasets
  • pipeline_reproducibility.py: Evaluates the Shapley overlap on the selected datasets for two independent optimization and training runs with LightGBM

Tutorial

Each script uses as default arguments the same parameters used in the paper. For example,here is the code to execute the script for obtaining the classification performance results:

cd /GBM_Benchmarking/Scripts
python3 pipeline_benchmarking.py

Here is the code to execute the same script, changing the number of iterations for optimization and evaluation:

cd /GBM_Benchmarking/Scripts
python3 pipeline_benchmarking.py --opt_iters 30 --iters_moleculenet 10 --iters_moldata 3

Check the pipeline_x.py files in the Scripts folder for a description of the available arguments for each script.

How to cite

Please refer to the following publication: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00743-7

About

Scripts to reproduce the results of the paper "Practical guidelines for the use of Gradient Boosting for molecular property prediction".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%