ADvisor tool allows the exploration of different applicabilty domain (AD) strategies to find the most suitable for each use case. Once the user has identified the preferred AD methodology, this can be applied to any query dataset. This repository provides necessary Python scripts for AD search and application, using a predefined environment and dependencies. A notebook is also included to provide an example on how to analyze results.
If you use ADvisor, please cite:
Piazza L., Poles C., Bononi G., Granchi C., Di Stefano M., Poli G., Macchia M., Tuccinardi T. ADvisor: an open-source tool for Applicability Domain definition and optimization in molecular predictive modeling. J. Chem. Inf. Model. 2025, ASAP. http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5c01672
To set up the required environment, use the provided YAML file:
conda env create -f env.ymlThen, activate the environment:
conda activate envRun the AD search script with Python, specifying the required inputs:
python Compare_AD_Strategies.py -train Train.csv -test Test.csv -repres RDKit-descriptors -mt regressor -test_tvc True -train_tvc True -test_pvc Pred -nj 4 -out Out1.csvPlease note that within ADvisor AD strategy the similarity formula used for regressors and classifiers is the one that performed best on average, respectively (we refer the user to the paper for further details).
Run the calculate AD script with Python, specifying the required inputs:
python Calculate_AD.py -train Train.csv -test Test.csv -query Query.csv -repres RDKit-descriptors -ad ComAD_AD_th-0.6_a-1_b-0.5_c-0.5_d-0.5 -mt regressor -test_tvc True -train_tvc True -test_pvc Pred -query_pvc Pred -nj 4 -out Out2.csvPlease note that the desired AD strategy to apply must be written in the same format returned by the AD search.
- The input files must be in CSV format and contain a column named
SMILES, which represents the molecular structures. This column is necessary in all input files. - For the AD search, train and test set used to derive and validate the model must be used. For the AD application, both train and test set are necessary, together with the desired query set containing compounds to label as IN or OUT AD according to the selected strategy.
- The train set must contain a column storing the experimental value or class of compounds. The test set must contain a column storing the experimental value or class, and a column storing the predicted value or class. The query set must contain a column storing the predicted value or class.
- Three CSV files (Train.csv, Test.csv, Query.csv) are included in the repository for testing purposes.
- The AD search script will generate a CSV file with the evaluated AD methodologies ranked, together with their performance on IN and OUT subsets.
- The AD application script will generate a CSV file containing the query smiles, the predictions for each query compound and a column storing the IN/OUT AD label according to the selected strategy.
All necessary dependencies are included in env.yml.