Genomic Publications Agent

A comprehensive framework for biomedical literature analysis, focusing on Genomic Region Entity Extension and Hybrid Relation Extraction. This repository implements state-of-the-art Natural Language Processing (NLP) techniques, including Large Language Model (LLM) augmentation and fine-tuning, to extract precise relationships between genomic coordinates, variants, genes, and diseases.

🔬 Core Components

The repository is structured around three main scientific modules:

1. BioNER Entity Extension Framework

Located in: src/genomic_publication_agent/entity_extension/

A robust framework designed to extend pre-trained biomedical Named Entity Recognition (NER) models (e.g., PubMedBERT, BioBERT) with new entity types without suffering from catastrophic forgetting.

Methodology: Utilizes a teacher-student architecture where the original model (Teacher) generates soft labels to preserve knowledge of original entities (Genes, Diseases, Chemicals), while the Student model learns the new entity type (e.g., GenomicRegion) from a small set of annotated examples.
Data Augmentation: Integrates an LLM-based augmentation pipeline (using GPT-4 or Claude) to generate high-quality synthetic training data, ensuring diversity and coverage.
Performance: Achieves >0.96 F1-score on new entities while maintaining >0.95 F1-score on original entities (see Experiment Results).

2. Biomedical Relation Extraction Pipeline

Located in: src/genomic_publication_agent/relation_extraction/

A hybrid pipeline combining rule-based precision with LLM-based semantic understanding to extract complex biological relationships.

Hybrid Architecture:
- Entity Extraction: Combines PubTator3 (API), GLiNER-BioMed (Zero-shot NER), and Regex (Genomic Coordinates) for comprehensive entity detection.
- Relation Validation: Uses LLMs (via LangChain) to validate potential relationships between entities (e.g., Variant causes Disease) by analyzing the full publication context.
Features:
- Genome Build Detection: Automatically identifies genome builds (hg19/hg38) to normalize coordinates.
- Context-Aware: Validates relations based on sentence-level and abstract-level context.

3. AIONER Fine-tuning for Genomic Regions

Located in: experiments/genomic_region_2025_11_02/

An experimental module focused on fine-tuning the AIONER (All-In-One Entity Recognizer) model to recognize GenomicRegion entities.

Goal: To create a fallback model capable of detecting specific genomic coordinates (e.g., "chr17:41234567") alongside standard biomedical entities.
Results (Experiment 2025_11_02):
- GenomicRegion F1: 0.978
- Gene F1: 0.983
- Disease F1: 0.978
- Overall F1: 0.968
Strategy: Fine-tuned microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext using a dataset augmented with LLM-generated examples, employing a replay buffer to prevent forgetting of AIONER's original 6 entity types.

🚀 Installation

Prerequisites

Python 3.9+
Git (with submodule support)

Setup

Clone the repository:

git clone https://github.com/yourusername/genomic-publications-agent.git
cd genomic-publications-agent

Initialize Submodules (AIONER):
```
git submodule update --init --recursive
```

Create Virtual Environment:

python -m venv venv
source venv/bin/activate

Install Dependencies:
```
pip install -r requirements.txt
```
Download AIONER Models (for Experimentation): Follow the instructions in experiments/genomic_region_2025_11_02/AIONER_QUICKSTART.md to download pretrained models to external/AIONER/pretrained_models/.

💻 Usage

1. Entity Extension Training

Train a model to recognize a new entity type (e.g., GenomicRegion):

python -m src.genomic_publication_agent.entity_extension.cli.main train \
  path/to/annotations.json \
  --model "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext" \
  --entity-type "GenomicRegion" \
  --output experiments/output_dir \
  --prevent-forgetting \
  --fallback-ner-model "aioner"

2. Relation Extraction

Run the pipeline to extract entities and relations from a set of PMIDs:

# Example usage via CLI (if implemented) or Python script
python -m src.cli.analyze --pmids 32735606 --output results.csv

3. Running Tests

The project maintains high test coverage.

pytest tests/

📂 Project Structure

genomic-publications-agent/
├── src/
│   └── genomic_publication_agent/
│       ├── entity_extension/       # BioNER Extension Framework
│       │   ├── training/           # Training logic (Trainer, Loss)
│       │   ├── augmentation/       # LLM Data Augmentation
│       │   └── inference/          # Inference pipelines
│       └── relation_extraction/    # Hybrid Relation Extraction
│           ├── entity_extractor.py # Regex + PubTator + GLiNER
│           └── relation_builder.py # LLM Relation Validation
├── experiments/                    # Experiment configs and results
│   └── genomic_region_2025_11_02/  # AIONER Fine-tuning Experiment
├── external/
│   └── AIONER/                     # AIONER Submodule
└── tests/                          # Comprehensive test suite

📊 Experiment Results

Results from the Genomic Region Extension experiment (2025-11-02):

Entity Type	Precision	Recall	F1 Score
GenomicRegion	0.956	1.000	0.978
Gene	0.967	1.000	0.983
Disease	1.000	0.958	0.978
Overall	0.972	0.964	0.968

Note: The model successfully learned to identify genomic regions without degrading performance on standard entities.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use this work, please cite:

[Author Name], et al. "A Hybrid Framework for Genomic Coordinate Extraction and Relation Mining in Biomedical Literature." 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
.cursor		.cursor
.specify		.specify
benchmarks/enhancer		benchmarks/enhancer
cache		cache
config		config
configs/entity_extension		configs/entity_extension
data		data
docs		docs
drafts		drafts
examples		examples
experiments		experiments
external		external
external_data		external_data
genomic_annotation_project_full		genomic_annotation_project_full
labeling		labeling
logs		logs
macos_annotation		macos_annotation
models		models
ner		ner
notebooks		notebooks
reports		reports
requirements		requirements
results		results
runs/gliner_ft		runs/gliner_ft
scripts		scripts
specs		specs
src		src
test_annotation_project/pmc_cache		test_annotation_project/pmc_cache
tests		tests
utils		utils
.DS_Store		.DS_Store
.coverage 2		.coverage 2
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DBSNP_INTEGRATION_SUMMARY.md		DBSNP_INTEGRATION_SUMMARY.md
Dockerfile.entity-extension		Dockerfile.entity-extension
EXECUTION_SUMMARY.md		EXECUTION_SUMMARY.md
GENOMIC_REGION_EXPERIMENT_SUMMARY.md		GENOMIC_REGION_EXPERIMENT_SUMMARY.md
LICENSE		LICENSE
LICENSE-DOCS		LICENSE-DOCS
LICENSE.documentation		LICENSE.documentation
PIPELINE_EXECUTION_REPORT.md		PIPELINE_EXECUTION_REPORT.md
README.md		README.md
check_e2e_status.sh		check_e2e_status.sh
check_together_models.py		check_together_models.py
dataset_build.log		dataset_build.log
debug_analyzer.py		debug_analyzer.py
genomic_coordinates_annotation_project.tar.gz		genomic_coordinates_annotation_project.tar.gz
models_registry.yaml		models_registry.yaml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Genomic Publications Agent

🔬 Core Components

1. BioNER Entity Extension Framework

2. Biomedical Relation Extraction Pipeline

3. AIONER Fine-tuning for Genomic Regions

🚀 Installation

Prerequisites

Setup

💻 Usage

1. Entity Extension Training

2. Relation Extraction

3. Running Tests

📂 Project Structure

📊 Experiment Results

📜 License

📚 Citation

About

Licenses found

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Licenses found

biodatageeks/genomic-publications-agent

Folders and files

Latest commit

History

Repository files navigation

Genomic Publications Agent

🔬 Core Components

1. BioNER Entity Extension Framework

2. Biomedical Relation Extraction Pipeline

3. AIONER Fine-tuning for Genomic Regions

🚀 Installation

Prerequisites

Setup

💻 Usage

1. Entity Extension Training

2. Relation Extraction

3. Running Tests

📂 Project Structure

📊 Experiment Results

📜 License

📚 Citation

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages