GBM Benchmarking

Repository containing the code and the datasets to reproduce the results from the paper "Practical guidelines for the use of Gradient Boosting for molecular property prediction".

Repository structure

Datasets: Contains all datasets used in this study as .csv files (must be unzipped first)
Scripts: Contains all scripts and utility functions used to reproduce the results
Results: Contains the outputs from the pipeline functons

Installation

The necessary environment can be configured via conda from the environment.yml file.

git clone https://github.com/dahvida/GBM_Benchmarking
conda env create --name GBM --file=environment.yml
conda activate GBM

Additionally, you need to install the fANOVA package:

git clone https://github.com/automl/fanova.git
cd fanova/
pip install -r requirements.txt
python setup.py install

Usage

All results can be reproduced by executing the respective pipeline_x.py files in the Scripts folder. The outputs from each script will be saved in the Results folder, either as .csv, .pkl or .txt files.

pipeline_benchmark.py: Returns ROC-AUC, PR-AUC, training time and Shapley overlap for all GBM implementations on the selected datasets
pipeline_hyperparam.py: Evaluates the importance of each hyperparameter using fANOVA on the selected datasets
pipeline_grid.py: Evaluates the performance of the grid with the most important hyperparameters versus optimizing all possible hyperparameters
pipeline_fragments.py: Draws the top 20 most important ECFP bits for a given dataset for all GBM implementations. Currently supports only the BACE, BBBP and HIV datasets
pipeline_reproducibility.py: Evaluates the Shapley overlap on the selected datasets for two independent optimization and training runs with LightGBM

Tutorial

Each script uses as default arguments the same parameters used in the paper. For example,here is the code to execute the script for obtaining the classification performance results:

cd /GBM_Benchmarking/Scripts
python3 pipeline_benchmarking.py

Here is the code to execute the same script, changing the number of iterations for optimization and evaluation:

cd /GBM_Benchmarking/Scripts
python3 pipeline_benchmarking.py --opt_iters 30 --iters_moleculenet 10 --iters_moldata 3

Check the pipeline_x.py files in the Scripts folder for a description of the available arguments for each script.

How to cite

Please refer to the following publication: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00743-7

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Pictures		Pictures
Results		Results
Scripts		Scripts
.gitattributes		.gitattributes
Datasets.zip		Datasets.zip
LICENCE.md		LICENCE.md
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GBM Benchmarking

Repository structure

Installation

Usage

Tutorial

How to cite

About

Uh oh!

Releases

Packages

Languages

License

sieber-lab/GBM_Benchmarking

Folders and files

Latest commit

History

Repository files navigation

GBM Benchmarking

Repository structure

Installation

Usage

Tutorial

How to cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages