Skip to content

simmzx/SynFrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AIDD PyPI GitHubEmail License: MIT

SynFrag

Synthetic Accessibility Predictor based on Fragment Assembly Generation
in Drug Discovery

📃 DOI · 📕 PDF

alt text

💗 What Makes SynFrag Different?

Predict the Synthetic Accessibility (SA) of molecules like an experienced synthetic chemist

To mirror chemist thinking, SynFrag revolutionizes SA prediction through Frament Assembly autoregressive Generation pretraining, transfer the synthesis workflow that chemists start with commercially available building blocks and perform stepwise assembly through systematic reactions to a machine learning task for SA prediction. SynFrag shows not just great performance domenstrated by SOTA across three benchmarks and two real-word scenario test set, but also chemical interpretability by fragment assembly patterns and attentive heatmap corresponding to reactive sites, demonstrated ability, the key to bridging in silico and in lab drug discovery, in identify "synthetic difficulty cliff" and relative SA of intermediates in multi-step synthetic reactions.

SA-aware Learning:

  • Stage 1: Pretrain on 9.18M unlabeled molecules to learn fragment assembly patterns.
  • Stage 2: Finetune on 800K labeled molecules to transfer the knowledge in SA prediction.

🎇 Key Features

  • ⚡ Accuracy and Robustness in High-speed.
  • 🧩 Chemical Intuition and Interpretability.
  • 📦 Easy Integration Online platform.

A free, open-access web platform combining rapid SA screening with interpretable attention heatmap and integrated retrosynthetic tool.

SynFrag

🚀 Quick Start

1. Installation

    # Clone repository
    git clone https://github.com/simmzx/SynFrag.git

    # Create environment and install dependencies
    conda create -n SynFrag python=3.8
    conda activate SynFrag
    pip install -r requirements.txt

2. Prepare Data

Create CSV file with "smiles" field:

molecule_id SMILES
Palbociclib CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C
(+)-Eburnamonine [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H]

3. Run Prediction

CSV Mode

    cd ../SynFrag/model/
    python synfrag.py --input_file example.csv

SMILES Mode

    # Single molecule
    python synfrag.py --smiles "CCO"
    # Multiple molecules
    python synfrag.py --smiles "CCO" "CC(=O)O" "c1ccccc1"

4. View Results

Output file will contain "SynFrag" perditions:

molecule_id SMILES SynFrag
Palbociclib CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C 0.9453
(+)-Eburnamonine [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] 0.0286

SynFrag Interpretation:

  • Score ≥ 0.5: Easy to Synthesize (ES)
  • Score < 0.5: Hard to Synthesize (HS)
  • Score close to 0.5: decision-boundary molecules

note: While 0.5 serves as the default binary cutoff, users can adjust thresholds based on application needs:

  • High-throughput screening (0.3-0.4): Prioritize recall, reduce false negatives of synthesizable candidates.
  • Resource-limited synthesis (0.6-0.7): Prioritize precision, avoid investing in difficult targets.

🛸 Advanced Usage

Custom Pretraining and Finetuning Workflow
For researchers working with specialized chemical spaces or proprietary compound libraries, SynFrag supports custom training pipelines.

1. Pretrain

    python synfrag_pretrain.py \
        --dataset smiles.txt \
        --vocab fragment.txt 

Input:

  • smiles.txt: Text file containing one SMILES string per line. These molecules are unlabeled, no any annotations required.
  • fragment.txt: Fragment vocabulary file generated by BRICS+2 fragmentation. Create using:
    python ./scripts/utils/mol/cls.py --input smiles.txt

This vocabulary defines the label space for the autoregressive assembly task, capturing common structural motifs in your dataset.

Output:

  • Pretrained AttentiveFP encoder saved as gnn_pretrained.pth, ready for downstream finetuning.

2. Finetune

    python synfrag_finetune.py \
        --input_model_file gnn_pretrained.pth \
        --dataset dataset.csv

Input:

  • gnn_pretrained.pth: Model checkpoint from pretraining stage containing molecular representations with fragment assembly patterns.
  • dataset.csv: CSV file with columns 'smiles' and binary label for your specific task.

Output:

  • Finetuned model saved as synfrag_finetuned.pth, ready for inference on target chemical space.

🛠️ Requirements

🤗 Citation

If you find this repository and our paper useful, we kindly request to cite our work.

@article{zhang2025synfrag,
  title     = {SynFrag: Synthetic Accessibility Predictor based on Fragment Assembly Generation in Drug Discovery},
  author    = {Zhang, Xiang and Liu, Jia and Xu, Bufan and Zhang, Zihan and Huang, Zifu and Chen, Kaixian and Wang, Dingyan and Li, Xutong},
  journal   = {ChemRxiv},
  year      = {2025}
}

💌 Contact

For technical support, please contact: Xiang Zhang (Email: zhangxiang@simm.ac.cn)


Like this project? Give us a Star

About

Synthetic Accessibility via Fragment Assembly Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages