AAAIM is a LLM-powered system for annotating biosimulation models with standardized ontology terms. It currently supports chemical, gene and protein annotation for species, and KEGG annotation for reactions in SBML models.
# python = 3.12
# Install dependencies
pip install -r requirements.txtSet up your LLM provider API keys:
# For OpenAI models (gpt-4o-mini, gpt-4.1-nano)
export OPENAI_API_KEY="your-openai-key"
# For OpenRouter models (meta-llama/llama-3.3-70b-instruct:free)
export OPENROUTER_API_KEY="your-openrouter-key"
# For LlaMa models (Llama-3.3-70B-Instruct)
export LLAMA_API_KEY="your-llama-key"Alternatively, you can setup an .env file that looks like the following:
OPENAI_API_KEY=<your-openai-api-key-here>
OPENROUTER_API_KEY=<your-openrouter-api-key-here>
LLAMA_API_KEY=<your-llama-api-key-here>AAAIM currently provides two main workflows for model annotation:
- Purpose: Annotate models with no or limited existing annotations
- Input: All species in the model
- Output: Annotation recommendations for all species
- Metrics: Accuracy is NA when no existing annotations available
- Large Models: Automatically splits models with >50 entities into chunks to avoid LLM context limits
from core import annotate_model
# Annotate all chemical species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="chemical",
database="chebi"
)
# Save results
recommendations_df.to_csv("chemical_annotation_results.csv", index=False)from core import annotate_model
# Annotate all gene species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="gene",
database="ncbigene",
tax_id="9606" # for human
)
# Save results
recommendations_df.to_csv("gene_annotation_results.csv", index=False)from core import annotate_model
# Annotate all gene species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="protein",
database="uniprot",
tax_id="9606" # for human
)
# Save results
recommendations_df.to_csv("protein_annotation_results.csv", index=False)AAAIM supports automatic detection of entity types (chemical, gene, protein, complex, or unknown) for models with mixed entity types:
from core import annotate_model
# Annotate all species in a model by automatic type detection
# The LLM will determine the entity type for each species
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="auto",
database=["chebi", "uniprot"] # choose from these databases
)
# The results will include a 'type' column indicating the detected entity type
# Species with unknown types are included in results but with empty predictions
print(recommendations_df[['species_id', 'type', 'synonyms_LLM', 'predictions_names']])
# Save results
recommendations_df.to_csv("auto_annotation_results.csv", index=False)How it works:
- The LLM analyzes each species in context (display names, reactions, model notes) to determine its type
- Detected types:
chemical,gene,protein,complex, orunknown - Database matching is performed using the appropriate database for each detected type
- The
databaseparameter accepts a list to specify which databases to use:- Chemicals → ChEBI
- Genes → NCBI Gene
- Proteins → UniProt
- Complexes → ChEBI, UniProt, or NCBI Gene
- Species with
unknowntype are included in results with their LLM-suggested synonyms but no database matches
- Purpose: Evaluate and improve existing annotations
- Input: Only species that already have annotations
- Output: Validation and improvement recommendations
- Metrics: Accuracy calculated against existing annotations
- Large Models: Automatically splits models with >50 entities into chunks to avoid LLM context limits
from core import curate_model
# Curate existing chemical annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="chemical",
database="chebi"
)
print(f"Chemical entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")from core import curate_model
# Curate existing gene annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="gene",
database="ncbigene"
)
print(f"Gene entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")from core import curate_model
# Curate existing gene annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="protein",
database="uniprot",
tax_id=9606 # for human
)
print(f"Gene entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")After running annotate_model or curate_model, you can review the resulting CSV file and edit the update_annotation column for each entity:
add: Add the recommended annotation to the model for that entity.delete: Remove the annotation for that entity.ignoreorkeep: Leave the annotation unchanged. Whether keep the existing one, or ignore the new suggestion.
To apply your changes and save a new SBML model:
from core.update_model import update_annotation
update_annotation(
original_model_path="path/to/original_model.xml",
recommendation_table="recommendations.csv", # or a pandas DataFrame
new_model_path="path/to/updated_model.xml",
qualifier="is" # (optional) bqbiol qualifier, default is 'is'
)A summary of added/removed annotations will be printed after update.
# More control over parameters
recommendations_df, metrics = annotate_model(
model_file = "path/to/model.xml",
llm_model = "meta-llama/llama-3.3-70b-instruct:free", # the LLM model used to predict annotations
max_entities = 100, # maximum number of entities to annotate (None for all)
entity_type = "gene", # type of entities to annotate ("chemical", "gene", "protein", "auto")
database = "ncbigene", # database to use ("chebi", "ncbigene", "uniprot") or list for auto mode
method = "direct", # method used to find the ontology ID ("direct", "rag")
top_k = 3, # number of LLM synonyms and top database candidates to return per entity
chunk_size = 50 # split large models into chunks of 50 entities (None for no chunking)
)
# Direct access to qualifier tracking functions
from core.model_info import find_species_with_annotations_and_qualifiers
# Get annotations and qualifiers for any supported database
annotations, qualifiers = find_species_with_annotations_and_qualifiers(
model_file="path/to/model.xml",
database="chebi", # or "ncbigene", "uniprot"
bqbiol_qualifiers=['is', 'isVersionOf'] # optional: filter by specific qualifiers
)
print(f"Found {len(annotations)} species with annotations")
for species_id, annotation_ids in annotations.items():
if species_id in qualifiers:
print(f"{species_id}: {annotation_ids}")
for ann_id, qualifier in qualifiers[species_id].items():
print(f" {ann_id} -> {qualifier}")
else:
print(f"{species_id}: {annotation_ids} (no qualifier info)")# Using "tests/test_models/BIOMD0000000190.xml"
python examples/simple_example.pyAAAIM provides tools for evaluating annotation quality and analyzing results:
from utils.evaluation import evaluate_single_model, print_evaluation_results
# Evaluate a single model and get detailed results with a 'type' column
result_df = evaluate_single_model(
model_file="path/to/model.xml",
llm_model="gpt-4o-mini",
method="direct",
top_k=3,
entity_type="auto", # or "chemical", "gene", "protein"
database=["chebi", "uniprot"] # for auto mode, or single database for specific type
)
# The result DataFrame includes a 'type' column showing detected entity types
print(result_df[['species_id', 'type', 'synonyms_LLM', 'predictions_names', 'accuracy']])
# Print summary statistics from a results CSV file
print_evaluation_results(
results_csv="results.csv",
ref_results_csv="reference_results.csv", # optional: filter to only species in reference
bqbiol_qualifiers=['is', 'isVersionOf'], # optional: filter by annotation qualifiers
entity_types=['chemical', 'protein'] # optional: filter by detected entity types
)Output columns:
detected_entity_type: Detected entity type (chemical, gene, protein, complex, or unknown)synonyms_LLM: LLM-suggested synonyms for the speciespredictions: Top-k database IDs matched for this speciespredictions_names: Corresponding names for the predicted IDsexist_annotation_id: Existing annotation IDs from the model (if any)exist_annotation_name: Names of existing annotationsaccuracy: Match accuracy between predictions and existing annotations
After LLM performs synonym normalization, use direct dictionary matching to find ontology ID and report hit counting. Returns the top_k candidates with the highest hit counts.
After LLM performs synonym normalization, use RAG with embeddings to find the top_k most similar ontology terms based on cosine similarity.
To use RAG, create embeddings of the ontology first:
cd data
# for ChEBI:
python load_data.py --database chebi --model default
# for NCBI gene, specify the taxnomy id:
python load_data.py --database ncbigene --model default --tax_id 9606
# for uniprot, specify the taxnomy id:
python load_data.py --database uniprot --model default --tax_id 9606
# for KEGG:
python load_data.py --database kegg --model default-
ChEBI: Chemical Entities of Biological Interest
- Entity Type:
chemical - All terms in ChEBI are included.
- Used for: small molecules, metabolites, compounds
- Entity Type:
-
NCBI Gene: Gene annotation
- Entity Type:
gene - Only genes for common species are supported (those included in bigg models).
- Used for: genes, DNA sequences, gene symbols
- Entity Type:
-
UniProt: Protein annotation
- Entity Type:
protein - Only proteins for human (9606) and mouse (10090) are supported for now.
- Used for: proteins, enzymes
- Entity Type:
-
KEGG: Compound/reaction annotation
- For reaction substrates and products.
When using entity_type="auto", AAAIM automatically selects the appropriate database(s) based on the detected entity type:
| Detected Type | Default Databases | Usage |
|---|---|---|
chemical |
ChEBI | Small molecules, metabolites, compounds |
gene |
NCBI Gene | Genes, DNA sequences, gene symbols |
protein |
UniProt | Proteins, enzymes |
complex |
ChEBI, UniProt, NCBI Gene | Protein complexes, chemical complexes |
unknown |
None | LLM synonyms included but no database matching |
You can restrict which databases are used by providing a database list parameter. For example, database=["chebi", "uniprot"] will only use ChEBI for chemicals and UniProt for proteins, but will not search NCBI Gene even if genes are detected.
- Rhea: Reaction annotation
- GO: Gene Ontology terms
- Location:
data/chebi/ - Files:
cleannames2chebi.lzma: Mapping from clean names to ChEBI IDschebi2label.lzma: Mapping from ChEBI IDs to labelschebi2names.lzma: ChEBI synonyms used for RAG approach
- Source: ChEBI ontology downloaded from https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz.
- Location:
data/ncbigene/ - Files:
names2ncbigene_bigg_organisms_protein-coding.lzma: Mapping from names to NCBI gene IDs, only include protein-coding genes from 18 species covered in Bigg models for file size considerationsncbigene2label_bigg_organisms_protein-coding.lzma: Mapping from NCBI gene IDs to labels (primary name)ncbigene2names_tax{tax_id}_protein-coding.lzma: NCBI gene synonyms for tax_id used for RAG approach
- Source: Data are obtained from the NCBI gene FTP site: https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/.
- Location:
data/uniprot/ - Files:
names2uniprot_human+mouse.lzma: Mapping from synonyms to UniProt IDs, only include human and mouse proteins for nowuniprot2label_human+mouse.lzma: Mapping from UniProt IDs to labels (primary name)uniprot2names_tax{tax_id}.lzma: Uniprot synonyms for tax_id used for RAG approach
- Source: Data are obtained from the UniProt site: https://www.uniprot.org/help/downloads (Reviewed (Swiss-Prot) xml).
- Location:
data/kegg/ - Files:
chebi_to_kegg_map.lzma: Mapping from ChEBI IDs to KEGG compound IDs.parsed_kegg_reactions.lzma: Dict of KEGG reactions and their attributes
- Source: Data are obtained from the KEGG site: https://rest.kegg.jp.
aaaim/
├── core/
│ ├── __init__.py # Main interface exports
│ ├── annotation_workflow.py # Annotation workflow (models without annotations)
│ ├── curation_workflow.py # Curation workflow (models with annotations)
│ ├── model_info.py # Model parsing and context
│ ├── llm_interface.py # LLM interaction
│ ├── database_search.py # Database search functions
│ └── update_model.py # put annotations into model
├── utils/
│ ├── constants.py
│ ├── evaluation.py # functions for evaluation
├── examples/
│ ├── simple_example.py # Simple usage demo
├── data/
│ ├── chebi/ # ChEBI compressed dictionaries
│ ├── ncbigene/ # NCBIgene compressed dictionaries
│ ├── uniprot/ # UniProt compressed dictionaries
│ ├── chroma_storage/ # Database embeddings for RAG
└── tests/
├── test_models # Test models
└── aaaim_evaluation.ipynb # evaluation notebook
- Multi-Database Support: GO, Rhea, mapping between ontologies
- Improve RAG for NCBI Gene: Test on other embedding models for genes
- Web Interface: User-friendly annotation tool