A comprehensive framework for biomedical literature analysis, focusing on Genomic Region Entity Extension and Hybrid Relation Extraction. This repository implements state-of-the-art Natural Language Processing (NLP) techniques, including Large Language Model (LLM) augmentation and fine-tuning, to extract precise relationships between genomic coordinates, variants, genes, and diseases.
The repository is structured around three main scientific modules:
Located in: src/genomic_publication_agent/entity_extension/
A robust framework designed to extend pre-trained biomedical Named Entity Recognition (NER) models (e.g., PubMedBERT, BioBERT) with new entity types without suffering from catastrophic forgetting.
- Methodology: Utilizes a teacher-student architecture where the original model (Teacher) generates soft labels to preserve knowledge of original entities (Genes, Diseases, Chemicals), while the Student model learns the new entity type (e.g.,
GenomicRegion) from a small set of annotated examples. - Data Augmentation: Integrates an LLM-based augmentation pipeline (using GPT-4 or Claude) to generate high-quality synthetic training data, ensuring diversity and coverage.
- Performance: Achieves >0.96 F1-score on new entities while maintaining >0.95 F1-score on original entities (see Experiment Results).
Located in: src/genomic_publication_agent/relation_extraction/
A hybrid pipeline combining rule-based precision with LLM-based semantic understanding to extract complex biological relationships.
- Hybrid Architecture:
- Entity Extraction: Combines PubTator3 (API), GLiNER-BioMed (Zero-shot NER), and Regex (Genomic Coordinates) for comprehensive entity detection.
- Relation Validation: Uses LLMs (via
LangChain) to validate potential relationships between entities (e.g., Variant causes Disease) by analyzing the full publication context.
- Features:
- Genome Build Detection: Automatically identifies genome builds (hg19/hg38) to normalize coordinates.
- Context-Aware: Validates relations based on sentence-level and abstract-level context.
Located in: experiments/genomic_region_2025_11_02/
An experimental module focused on fine-tuning the AIONER (All-In-One Entity Recognizer) model to recognize GenomicRegion entities.
- Goal: To create a fallback model capable of detecting specific genomic coordinates (e.g., "chr17:41234567") alongside standard biomedical entities.
- Results (Experiment
2025_11_02):- GenomicRegion F1: 0.978
- Gene F1: 0.983
- Disease F1: 0.978
- Overall F1: 0.968
- Strategy: Fine-tuned
microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltextusing a dataset augmented with LLM-generated examples, employing a replay buffer to prevent forgetting of AIONER's original 6 entity types.
- Python 3.9+
- Git (with submodule support)
-
Clone the repository:
git clone https://github.com/yourusername/genomic-publications-agent.git cd genomic-publications-agent -
Initialize Submodules (AIONER):
git submodule update --init --recursive
-
Create Virtual Environment:
python -m venv venv source venv/bin/activate -
Install Dependencies:
pip install -r requirements.txt
-
Download AIONER Models (for Experimentation): Follow the instructions in
experiments/genomic_region_2025_11_02/AIONER_QUICKSTART.mdto download pretrained models toexternal/AIONER/pretrained_models/.
Train a model to recognize a new entity type (e.g., GenomicRegion):
python -m src.genomic_publication_agent.entity_extension.cli.main train \
path/to/annotations.json \
--model "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext" \
--entity-type "GenomicRegion" \
--output experiments/output_dir \
--prevent-forgetting \
--fallback-ner-model "aioner"Run the pipeline to extract entities and relations from a set of PMIDs:
# Example usage via CLI (if implemented) or Python script
python -m src.cli.analyze --pmids 32735606 --output results.csvThe project maintains high test coverage.
pytest tests/genomic-publications-agent/
βββ src/
β βββ genomic_publication_agent/
β βββ entity_extension/ # BioNER Extension Framework
β β βββ training/ # Training logic (Trainer, Loss)
β β βββ augmentation/ # LLM Data Augmentation
β β βββ inference/ # Inference pipelines
β βββ relation_extraction/ # Hybrid Relation Extraction
β βββ entity_extractor.py # Regex + PubTator + GLiNER
β βββ relation_builder.py # LLM Relation Validation
βββ experiments/ # Experiment configs and results
β βββ genomic_region_2025_11_02/ # AIONER Fine-tuning Experiment
βββ external/
β βββ AIONER/ # AIONER Submodule
βββ tests/ # Comprehensive test suite
Results from the Genomic Region Extension experiment (2025-11-02):
| Entity Type | Precision | Recall | F1 Score |
|---|---|---|---|
| GenomicRegion | 0.956 | 1.000 | 0.978 |
| Gene | 0.967 | 1.000 | 0.983 |
| Disease | 1.000 | 0.958 | 0.978 |
| Overall | 0.972 | 0.964 | 0.968 |
Note: The model successfully learned to identify genomic regions without degrading performance on standard entities.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work, please cite:
[Author Name], et al. "A Hybrid Framework for Genomic Coordinate Extraction and Relation Mining in Biomedical Literature." 2025.