SynFrag

Synthetic Accessibility Predictor based on Fragment Assembly Generation
in Drug Discovery

📃 DOI · 📕 PDF

💗 What Makes SynFrag Different?

Predict the Synthetic Accessibility (SA) of molecules like an experienced synthetic chemist

To mirror chemist thinking, SynFrag revolutionizes SA prediction through Frament Assembly autoregressive Generation pretraining, transfer the synthesis workflow that chemists start with commercially available building blocks and perform stepwise assembly through systematic reactions to a machine learning task for SA prediction. SynFrag shows not just great performance domenstrated by SOTA across three benchmarks and two real-word scenario test set, but also chemical interpretability by fragment assembly patterns and attentive heatmap corresponding to reactive sites, demonstrated ability, the key to bridging in silico and in lab drug discovery, in identify "synthetic difficulty cliff" and relative SA of intermediates in multi-step synthetic reactions.

SA-aware Learning:

Stage 1: Pretrain on 9.18M unlabeled molecules to learn fragment assembly patterns.
Stage 2: Finetune on 800K labeled molecules to transfer the knowledge in SA prediction.

🎇 Key Features

⚡ Accuracy and Robustness in High-speed.
🧩 Chemical Intuition and Interpretability.
📦 Easy Integration Online platform.

🌐 SynFrag Online Service

A free, open-access web platform combining rapid SA screening with interpretable attention heatmap and integrated retrosynthetic tool.

🚀 Quick Start

1. Installation

    # Clone repository
    git clone https://github.com/simmzx/SynFrag.git

    # Create environment and install dependencies
    conda create -n SynFrag python=3.8
    conda activate SynFrag
    pip install -r requirements.txt

2. Prepare Data

Create CSV file with "smiles" field:

molecule_id	SMILES
Palbociclib	CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C
(+)-Eburnamonine	[C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H]

3. Run Prediction

CSV Mode

    cd ../SynFrag/model/
    python synfrag.py --input_file example.csv

SMILES Mode

    # Single molecule
    python synfrag.py --smiles "CCO"
    # Multiple molecules
    python synfrag.py --smiles "CCO" "CC(=O)O" "c1ccccc1"

4. View Results

Output file will contain "SynFrag" perditions:

molecule_id	SMILES	SynFrag
Palbociclib	CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C	0.9453
(+)-Eburnamonine	[C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H]	0.0286

SynFrag Interpretation:

Score ≥ 0.5: Easy to Synthesize (ES)

Score < 0.5: Hard to Synthesize (HS)

Score close to 0.5: decision-boundary molecules

note: While 0.5 serves as the default binary cutoff, users can adjust thresholds based on application needs:

High-throughput screening (0.3-0.4): Prioritize recall, reduce false negatives of synthesizable candidates.

Resource-limited synthesis (0.6-0.7): Prioritize precision, avoid investing in difficult targets.

🛸 Advanced Usage

Custom Pretraining and Finetuning Workflow
For researchers working with specialized chemical spaces or proprietary compound libraries, SynFrag supports custom training pipelines.

1. Pretrain

    python synfrag_pretrain.py \
        --dataset smiles.txt \
        --vocab fragment.txt

Input:

smiles.txt: Text file containing one SMILES string per line. These molecules are unlabeled, no any annotations required.

fragment.txt: Fragment vocabulary file generated by BRICS+2 fragmentation. Create using:

    python ./scripts/utils/mol/cls.py --input smiles.txt

This vocabulary defines the label space for the autoregressive assembly task, capturing common structural motifs in your dataset.

Output:

Pretrained AttentiveFP encoder saved as gnn_pretrained.pth, ready for downstream finetuning.

2. Finetune

    python synfrag_finetune.py \
        --input_model_file gnn_pretrained.pth \
        --dataset dataset.csv

Input:

gnn_pretrained.pth: Model checkpoint from pretraining stage containing molecular representations with fragment assembly patterns.

dataset.csv: CSV file with columns 'smiles' and binary label for your specific task.

Output:

Finetuned model saved as synfrag_finetuned.pth, ready for inference on target chemical space.

🛠️ Requirements

Python ≥ 3.8
≥ 4 TESLA A100 GPUs (recommended)
Key dependencies: PyTorch, Miniforge, RDKit, DGL, DGL-Life, DeepChem

🤗 Citation

If you find this repository and our paper useful, we kindly request to cite our work.

@article{zhang2025synfrag,
  title     = {SynFrag: Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery},
  author    = {Zhang, Xiang and Liu, Jia and Xu, Bufan and Zhang, Zihan and Huang, Zifu and Chen, Kaixian and Wang, Dingyan and Li, Xutong},
  journal   = {ChemRxiv},
  year      = {2025}
}

💌 Contact

For technical support, please contact: Xiang Zhang (Email: zhangxiang@simm.ac.cn)

⭐ Like this project? Give us a Star

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
checkpoints		checkpoints
data		data
model		model
LICENSE		LICENSE
README.md		README.md
SynFrag.png		SynFrag.png
SynFrag_web.png		SynFrag_web.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SynFrag

Synthetic Accessibility Predictor based on Fragment Assembly Generation
in Drug Discovery

💗 What Makes SynFrag Different?

SA-aware Learning:

🎇 Key Features

🌐 SynFrag Online Service

🚀 Quick Start

1. Installation

2. Prepare Data

3. Run Prediction

4. View Results

🛸 Advanced Usage

1. Pretrain

2. Finetune

🛠️ Requirements

🤗 Citation

💌 Contact

About

Uh oh!

Releases

Packages

Languages

License

simmzx/SynFrag

Folders and files

Latest commit

History

Repository files navigation

SynFrag

Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery

💗 What Makes SynFrag Different?

SA-aware Learning:

🎇 Key Features

🌐 SynFrag Online Service

🚀 Quick Start

1. Installation

2. Prepare Data

3. Run Prediction

4. View Results

🛸 Advanced Usage

1. Pretrain

2. Finetune

🛠️ Requirements

🤗 Citation

💌 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Synthetic Accessibility Predictor based on Fragment Assembly Generation
in Drug Discovery

Packages