The config file is structured in different sections. The parameters relevant for this work are described. Information about other parameters and options can be found in the repository of REINVENT4.
run_type: Set tostaged_learningto run the inverse design for molecular generation guided by reward function.device: Set tocpuorcuda:0depending on the available hardware.
summary_csv_prefix: Prefix for the summary csv file containing the results of the run.prior_file: Path to a pretrained prior model. The priors used in this work can be found in theREINVENT4/priors/directory. For this work thereinvent.prior,reinvent_zinc.prior, andreinvent_tadf.priormodels were used.agent_file: Path to the pretrained agent model. Generally, prior and agent refer to the same model, however, prior is a model that is trained to generate molecules according to some training distribution and agent is a fine-tuned model that is trained to generate molecules according to some property distribution defined by the reward function.batch_size: Number of molecules generated per batch.
This set of parameters is optional and can be used to increase the diversity of generated molecules.
type: Choose betweenIdenticalTopologicalScaffold,ScaffoldSimilarity,PenalizeSameSmiles, andIdenticalMurckoScaffold. In this work,ScaffoldSimilaritywas used exclusively.bucket_size: Number of molecules with the same scaffold/similarity that can be accepted before penalizing further molecules.minscore: Minimum score a molecule must have to be registered in the diversity filter.minsimilarity: Minimum similarity threshold for scaffold similarity.penalty_multiplier: penalty factor for PenalizeSameSmiles
-
chkpt_file: Name of the checkpoint file to store the agent model after the run. -
max_score: Termination criterion for the run. -
min_steps: Minimum number of steps to run. -
max_steps: Maximum number of steps to run. -
type: Choose the aggregation method for the individual components of the reward function. Available options arecustom_product,custom_sum,hypervolume, andprod_plus_hypervolume. For the aggregation methods using the hypervolume its important that the individual scoring components are scaled between 0 and 1 and that all components have the same weight.
-
Custom alerts
The custom alerts component can be used to penalize molecules containing unwanted substructures.-
name: Set toUnwanted SMARTS -
weight: weight to fine-tune the relevance of this component -
params.smarts: List of SMARTS strings defining unwanted substructures.
-
-
Triplet energy prediction
The EnTdecker model is used to predict the triplet energy of generated molecules as described in this paper.-
name: Set to name for scoring component, e.g.,"EnTdecker" -
weight: set weight of the component in the overall score. -
params.checkpoint_dir: Path to the directory containing the pretrained EnTdecker model. A downloaded model can be found inModels/triplet_energy/model_42.pt -
params.rdkit_2d_normalized: Set totrueto use normalized 2D descriptors. Required for the EnTdecker model. -
params.target_column: Set to"e_t". Required for the EnTdecker model.
-
-
ML-predicted absorption wavelength prediction
The multi-fidelity model described by Greenman et al. paper is used to predict the maximum absorption wavelength of generated molecules.-
name: Set to name for scoring component, e.g.,"ChemProp_uvvis" -
weight: set weight of the component in the overall score. -
params.checkpoint4featuregen: Path to ensemble of ChemProp models used for predicting S1 excitation energy. The models used in this work can be found inModels/uvvis/lambda_max_abs_wb97xd3/chemprop/all_wb97xd3/production/fold_0 -
params.checkpoint_dir: Path to the ChemProp model checkpoint used for the low-fidelity prediction. The model used in this work can be found inModels/uvvis/lambda_max_abs/chemprop_tddft/combined/production/fold_0 -
params.tmp_dir: Path to a temporary directory for storing intermediate files. -
params.target_column: Set to"peakwavs_max". -
params.rdkit_2d_normalized: Set tofalse.
-
-
Semi-empirical absorption wavelength prediction
The semi-empirical excited state calculation component uses xtb and stda to calculate the maximum absorption wavelength of generated molecules.-
name: Set to name for scoring component, e.g.,"SQM_lambda_max" -
weight: set weight of the component in the overall score. -
params.tmp_dir: Name of a temporary directory for storing intermediate files. Since this folder gets created temporarily for intermediate files and deleted afterwards, it must be a non-existing directory. So, e.g.,/InvEnT/tmp/multiwfn_calcdirwhere the directorymultiwfn_calcdirgets created during the generative process and can not be present before running the code. The run will terminate if the provided path is an existing directory to prevent unintential deletion of folder. -
params.path_to_xtb: Path to the xtb executable. -
params.path_to_stda: Path to the directory containing the xtb4stda binary. -
params.maximum_waiting_time: Maximum waiting time for the geometry optimization in seconds. -
params.use_stddft: Set tofalseto use stda instead of stddft for excited state calculations. -
params.use_gfnff: Set totrueto use gfn-ff for geometry optimization instead of gfn2. -
params.target_property: Set tolambda_max.
-
-
Excited state character
Computes the HOMO-LUMO overlap and estimates the nature of the excited state (CT or LE) using xtb and Multiwfn.-
name: Set to"Overlap_quick". -
weight: Set weight of the component. -
params.dir4tempfiles:Name of a temporary directory for storing intermediate files. Since this folder gets created temporarily for intermediate files and deleted afterwards, it must be a non-existing directory. So, e.g.,/InvEnT/tmp/multiwfn_calcdirwhere the directorymultiwfn_calcdirgets created during the generative process and can not be present before running the code. The run will terminate if the provided path is an existing directory to prevent unintential deletion of folder. -
params.path_to_xtb: Path to the xtb executable. -
params.path_to_multiwfn: Path to the Multiwfn executable. -
params.calculation_mode: Set tomultiwfn_quick(only singlet geometry used for FMOs) ormultiwfn(both singlet and triplet states optimized). -
params.use_gfn2: Set tofalse(optimization uses GFN-FF). -
params.aggregation_mode: Choose betweenformula(weighted sum) andthreshold(binary score).-
formula: Thescoreis the sum of the Singlet part ($S_{\text{part}}$ ) and the Triplet part ($T_{\text{part}}$ ):
$$\text{score} = S_{\text{part}} + T_{\text{part}}$$
The parts are calculated using the following parameters:Variable Parameter in config file Description $w_{\text{singlet}}$ ,$w_{\text{triplet}}$ params.Singlet_param,params.Triplet_paramWeights for the overall Singlet and Triplet contributions. $S_{\text{overlap}}$ ,$S_{\text{distance}}$ params.S_overlap,params.S_distanceWeights for the singlet overlap value ( $O_{S_1}$ ) and distance ($D_{S_1}$ ) of HOMO and LUMO center.$T_{\text{overlap}}$ ,$T_{\text{distance}}$ params.T_overlap,params.T_distanceWeights for the triplet overlap value ( $O_{T_1}$ ) and distance ($D_{T_1}$ ) of HOMO and LUMO center. -
threshold: Thescoreis a weighted sum of two binary components ($S_{\text{part}}$ and$T_{\text{part}}$ ), which are either 0 or 1:
$$\text{score} = w_{\text{singlet}} \cdot S_{\text{part}} + w_{\text{triplet}} \cdot T_{\text{part}}$$
The binary parts are determined by the following conditions:Component Condition Character Represented $S_{\text{part}}$ 1 if $O_{S_1}$ <params.S_overlap
0 otherwise.CT (Charge Transfer) $T_{\text{part}}$ 1 if $O_{T_1}$ >params.T_overlap
0 otherwise.LE (Locally excited)
-
-
-
Conjugation
Computes the degree of conjugation in molecules.-
name: Set to"Conjugation". -
weight: Set weight of the component. -
params.mode: Choose betweenfraction(fraction of conjugated bonds) andlargest_conjugated_fragment(size of the largest fragment). -
params.exclude_split_system: Set totrueto exclude molecules with disconnected conjugated structures.
-
To define the target values for each scoring component, the corresponding transformation functions have to be set by using the transforms keywords in the config file.