Skip to content

biodatageeks/genomic-publications-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Genomic Publications Agent

A comprehensive framework for biomedical literature analysis, focusing on Genomic Region Entity Extension and Hybrid Relation Extraction. This repository implements state-of-the-art Natural Language Processing (NLP) techniques, including Large Language Model (LLM) augmentation and fine-tuning, to extract precise relationships between genomic coordinates, variants, genes, and diseases.

πŸ”¬ Core Components

The repository is structured around three main scientific modules:

1. BioNER Entity Extension Framework

Located in: src/genomic_publication_agent/entity_extension/

A robust framework designed to extend pre-trained biomedical Named Entity Recognition (NER) models (e.g., PubMedBERT, BioBERT) with new entity types without suffering from catastrophic forgetting.

  • Methodology: Utilizes a teacher-student architecture where the original model (Teacher) generates soft labels to preserve knowledge of original entities (Genes, Diseases, Chemicals), while the Student model learns the new entity type (e.g., GenomicRegion) from a small set of annotated examples.
  • Data Augmentation: Integrates an LLM-based augmentation pipeline (using GPT-4 or Claude) to generate high-quality synthetic training data, ensuring diversity and coverage.
  • Performance: Achieves >0.96 F1-score on new entities while maintaining >0.95 F1-score on original entities (see Experiment Results).

2. Biomedical Relation Extraction Pipeline

Located in: src/genomic_publication_agent/relation_extraction/

A hybrid pipeline combining rule-based precision with LLM-based semantic understanding to extract complex biological relationships.

  • Hybrid Architecture:
    • Entity Extraction: Combines PubTator3 (API), GLiNER-BioMed (Zero-shot NER), and Regex (Genomic Coordinates) for comprehensive entity detection.
    • Relation Validation: Uses LLMs (via LangChain) to validate potential relationships between entities (e.g., Variant causes Disease) by analyzing the full publication context.
  • Features:
    • Genome Build Detection: Automatically identifies genome builds (hg19/hg38) to normalize coordinates.
    • Context-Aware: Validates relations based on sentence-level and abstract-level context.

3. AIONER Fine-tuning for Genomic Regions

Located in: experiments/genomic_region_2025_11_02/

An experimental module focused on fine-tuning the AIONER (All-In-One Entity Recognizer) model to recognize GenomicRegion entities.

  • Goal: To create a fallback model capable of detecting specific genomic coordinates (e.g., "chr17:41234567") alongside standard biomedical entities.
  • Results (Experiment 2025_11_02):
    • GenomicRegion F1: 0.978
    • Gene F1: 0.983
    • Disease F1: 0.978
    • Overall F1: 0.968
  • Strategy: Fine-tuned microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext using a dataset augmented with LLM-generated examples, employing a replay buffer to prevent forgetting of AIONER's original 6 entity types.

πŸš€ Installation

Prerequisites

  • Python 3.9+
  • Git (with submodule support)

Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/genomic-publications-agent.git
    cd genomic-publications-agent
  2. Initialize Submodules (AIONER):

    git submodule update --init --recursive
  3. Create Virtual Environment:

    python -m venv venv
    source venv/bin/activate
  4. Install Dependencies:

    pip install -r requirements.txt
  5. Download AIONER Models (for Experimentation): Follow the instructions in experiments/genomic_region_2025_11_02/AIONER_QUICKSTART.md to download pretrained models to external/AIONER/pretrained_models/.

πŸ’» Usage

1. Entity Extension Training

Train a model to recognize a new entity type (e.g., GenomicRegion):

python -m src.genomic_publication_agent.entity_extension.cli.main train \
  path/to/annotations.json \
  --model "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext" \
  --entity-type "GenomicRegion" \
  --output experiments/output_dir \
  --prevent-forgetting \
  --fallback-ner-model "aioner"

2. Relation Extraction

Run the pipeline to extract entities and relations from a set of PMIDs:

# Example usage via CLI (if implemented) or Python script
python -m src.cli.analyze --pmids 32735606 --output results.csv

3. Running Tests

The project maintains high test coverage.

pytest tests/

πŸ“‚ Project Structure

genomic-publications-agent/
β”œβ”€β”€ src/
β”‚   └── genomic_publication_agent/
β”‚       β”œβ”€β”€ entity_extension/       # BioNER Extension Framework
β”‚       β”‚   β”œβ”€β”€ training/           # Training logic (Trainer, Loss)
β”‚       β”‚   β”œβ”€β”€ augmentation/       # LLM Data Augmentation
β”‚       β”‚   └── inference/          # Inference pipelines
β”‚       └── relation_extraction/    # Hybrid Relation Extraction
β”‚           β”œβ”€β”€ entity_extractor.py # Regex + PubTator + GLiNER
β”‚           └── relation_builder.py # LLM Relation Validation
β”œβ”€β”€ experiments/                    # Experiment configs and results
β”‚   └── genomic_region_2025_11_02/  # AIONER Fine-tuning Experiment
β”œβ”€β”€ external/
β”‚   └── AIONER/                     # AIONER Submodule
└── tests/                          # Comprehensive test suite

πŸ“Š Experiment Results

Results from the Genomic Region Extension experiment (2025-11-02):

Entity Type Precision Recall F1 Score
GenomicRegion 0.956 1.000 0.978
Gene 0.967 1.000 0.983
Disease 1.000 0.958 0.978
Overall 0.972 0.964 0.968

Note: The model successfully learned to identify genomic regions without degrading performance on standard entities.

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“š Citation

If you use this work, please cite:

[Author Name], et al. "A Hybrid Framework for Genomic Coordinate Extraction and Relation Mining in Biomedical Literature." 2025.

About

No description, website, or topics provided.

Resources

License

MIT and 2 other licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE.documentation
Unknown
LICENSE-DOCS

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •