Skip to content

PharMolix/OCSU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery

arXiv ckpts dataset

Molecules represent tokens of the language of chemistry, which underlies not only chemistry itself, but also scientific fields that use chemical information such as pharmacy, material science, and molecular biology. Existing molecular information is distributed across text books, publications, and patents. To describe structural information (spatial arrangement of atoms), molecules are commonly drawn as 2D images in such documents, which makes Optical Chemical Structure Understanding (OCSU) play an important role in molecule-centric scientific discovery.

Optical chemical structure recognition (OCSR) is a well-explored task in the molecular information extraction field, which takes the image as input and predicts the SMILES representation of the molecule. Nevertheless, SMILES representation is machine readable strings, while it is not chemist-friendly. This hampers overall understanding of the optical chemical structure and hinders further application of cutting-edge natural language processing approaches, e.g. LLM, for moleculecentric scientific discovery. Our work expands low-level recognition to multilevel understanding and aims to translate chemical structure diagrams into readable strings for both machine and chemist.

OCSU aims to automatically translate chemical structure diagrams into chemist-readable or machine-readable strings that describe the molecule from motif level to molecule level and abstract level. Typically, it includes four subtasks, that is, functional group caption, molecular description, chemist-readable IUPAC naming, and machine-readable SMILES naming (OCSR). On the basis of these, molecular structural information can be fully extracted to support downstream tasks, such as moleculecentric chat, property prediction, and molecule editing.

More information can be found at Arxiv. OCSU will be added to OpenBioMed toolkit. Stay tuned!

News

Setup

  • Setup environment

    # Virtual environment
    conda create -n ocsu python=3.10
    conda activate ocsu
    
    # llama_factory (0.9.1.dev0) setup
    cd OCSU
    pip install -r requirements.txt
    pip install -e .[metrics]
    llamafactory-cli version
    
    # Optional installation
    pip install deepspeed
    pip install flash-attn --no-build-isolation
  • Setup dataset

    • Download Vis-CheBI20 Dataset

    • Create the symlinks

      # Training set
      ln -s ${Vis-CheBI20_ROOT}/train.json data/Vis-CheBI20
      # Test set
      ln -s ${Vis-CheBI20_ROOT}/test.json data/Vis-CheBI20

Experiments

  • Training

    # Here we provide a script for training with Vis-CheBI20.
    CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train config/mol-vl_7b.yaml
  • Inference

    Mol-VL-7B is released @ HuggingFace

    • Inference service

      # Setup service @ port 8000
      CUDA_VISIBLE_DEVICES=0 API_PORT=8000 llamafactory-cli api config/inference/mol-vl_7b.yaml
    • Predict

      # Here we provide a script for evaluation on Vis-CheBI20.
      python scripts/inference.py
  • Evaluation

    # Here we provide a script for evaluation on Vis-CheBI20.
    python scripts/evaluate.py

    scibert_scivocab_uncased is adopted for evaluation. Please download to ckpts/scibert_scivocab_uncased.

  • Demo

    Checkout our Jupytor notebooks for a quick start!

Acknowledgment

This repo benefits from LLaMA_Factory group. Thanks for their wonderful works.

Citation

If you find our work helpful to your research, please consider giving this repository a 🌟star and 📎citing the following article. Thank you for your support!

@article{fan2025ocsu,
  title={OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery},
  author={Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2501.15415},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors