Predict the Synthetic Accessibility (SA) of molecules like an experienced synthetic chemist
To mirror chemist thinking, SynFrag revolutionizes SA prediction through Frament Assembly autoregressive Generation pretraining, transfer the synthesis workflow that chemists start with commercially available building blocks and perform stepwise assembly through systematic reactions to a machine learning task for SA prediction. SynFrag shows not just great performance domenstrated by SOTA across three benchmarks and two real-word scenario test set, but also chemical interpretability by fragment assembly patterns and attentive heatmap corresponding to reactive sites, demonstrated ability, the key to bridging in silico and in lab drug discovery, in identify "synthetic difficulty cliff" and relative SA of intermediates in multi-step synthetic reactions.
- Stage 1: Pretrain on 9.18M unlabeled molecules to learn fragment assembly patterns.
- Stage 2: Finetune on 800K labeled molecules to transfer the knowledge in SA prediction.
- ⚡ Accuracy and Robustness in High-speed.
- 🧩 Chemical Intuition and Interpretability.
- 📦 Easy Integration Online platform.
A free, open-access web platform combining rapid SA screening with interpretable attention heatmap and integrated retrosynthetic tool.
# Clone repository
git clone https://github.com/simmzx/SynFrag.git
# Create environment and install dependencies
conda create -n SynFrag python=3.8
conda activate SynFrag
pip install -r requirements.txtCreate CSV file with "smiles" field:
| molecule_id | SMILES |
|---|---|
| Palbociclib | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C |
| (+)-Eburnamonine | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] |
CSV Mode
cd ../SynFrag/model/
python synfrag.py --input_file example.csvSMILES Mode
# Single molecule
python synfrag.py --smiles "CCO"
# Multiple molecules
python synfrag.py --smiles "CCO" "CC(=O)O" "c1ccccc1"Output file will contain "SynFrag" perditions:
| molecule_id | SMILES | SynFrag |
|---|---|---|
| Palbociclib | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C | 0.9453 |
| (+)-Eburnamonine | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] | 0.0286 |
SynFrag Interpretation:
- Score ≥ 0.5: Easy to Synthesize (ES)
- Score < 0.5: Hard to Synthesize (HS)
- Score close to 0.5: decision-boundary molecules
note: While 0.5 serves as the default binary cutoff, users can adjust thresholds based on application needs:
- High-throughput screening (0.3-0.4): Prioritize recall, reduce false negatives of synthesizable candidates.
- Resource-limited synthesis (0.6-0.7): Prioritize precision, avoid investing in difficult targets.
Custom Pretraining and Finetuning Workflow
For researchers working with specialized chemical spaces or proprietary compound libraries, SynFrag supports custom training pipelines.
python synfrag_pretrain.py \
--dataset smiles.txt \
--vocab fragment.txt Input:
smiles.txt: Text file containing one SMILES string per line. These molecules are unlabeled, no any annotations required.
fragment.txt: Fragment vocabulary file generated by BRICS+2 fragmentation. Create using:
python ./scripts/utils/mol/cls.py --input smiles.txtThis vocabulary defines the label space for the autoregressive assembly task, capturing common structural motifs in your dataset.
Output:
- Pretrained AttentiveFP encoder saved as
gnn_pretrained.pth, ready for downstream finetuning.
python synfrag_finetune.py \
--input_model_file gnn_pretrained.pth \
--dataset dataset.csvInput:
gnn_pretrained.pth: Model checkpoint from pretraining stage containing molecular representations with fragment assembly patterns.
dataset.csv: CSV file with columns 'smiles' and binary label for your specific task.
Output:
- Finetuned model saved as
synfrag_finetuned.pth, ready for inference on target chemical space.
- Python ≥ 3.8
- ≥ 4 TESLA A100 GPUs (recommended)
- Key dependencies: PyTorch, Miniforge, RDKit, DGL, DGL-Life, DeepChem
If you find this repository and our paper useful, we kindly request to cite our work.
@article{zhang2025synfrag,
title = {SynFrag: Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery},
author = {Zhang, Xiang and Liu, Jia and Xu, Bufan and Zhang, Zihan and Huang, Zifu and Chen, Kaixian and Wang, Dingyan and Li, Xutong},
journal = {ChemRxiv},
year = {2025}
}For technical support, please contact: Xiang Zhang (Email: zhangxiang@simm.ac.cn)
⭐ Like this project? Give us a Star
