This repository provides an implementation of antibody affinity prediction using a LLaMA3 transformer backbone with 5-fold cross-validation. The model is trained and evaluated on protein molecular sequence OAS data procuring from AntiFormer github and benchmarked and assessed with multiple metrics, including accuracy, precision, recall, F1-score, ROC-AUC, and training time.
- 5-fold stratified cross-validation for robust evaluation.
- Model built using keras-hub LLaMA3Backbone.
- Metrics: Accuracy, Precision, Recall, F1, ROC-AUC.
- Visualizations: Confusion Matrix, ROC Curve, Loss/Accuracy plots.
- Model saving and checkpointing for each fold.
- Python 3.10+
- TensorFlow 2.15+
- keras-hub, keras-nlp
- HuggingFace datasets
- RDKit
- scikit-learn, matplotlib, seaborn
AntiFormer/
├── data/ # Dataset & processing scripts
│ ├── cdr1H.txt
│ ├── cdr1L.txt
│ ├── cdr2H.txt
│ ├── cdr2L.txt
│ ├── cdr3H.txt.zip
│ ├── cdr3L.txt
│ ├── manifest\_230324.csv
│ ├── data\_download.py
│ ├── data\_process.py
│ ├── dataset\_making.py
│ ├── dt\_rebuild.py
│ └── protbert/ # Tokenizer files for protein language model
│
├── subdt/ # HuggingFace preprocessed dataset
│ ├── data-00000-of-00001.arrow
│ ├── dataset\_info.json
│ └── state.json
Clone the AntiFormer and LlamaAffinity repositories and install dependencies:
git clone https://github.com/QSong-github/AntiFormer.git
cd AntiFormer
pip install -q rdkit datasets keras-nlp keras-hub tensorflow matplotlib seaborn scikit-learnThe dataset is stored in HuggingFace format:
from datasets import load_from_disk
dataset = load_from_disk('subdt')Stratified K-Fold splitting ensures balanced label distribution:
from sklearn.model_selection import StratifiedKFoldmodel = get_model()
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer=tf.keras.optimizers.Adam(0.0001),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
history = model.fit(ds_train, epochs=10, validation_data=ds_test)- Confusion Matrix
- ROC Curve & AUC
- Accuracy & Loss trends
Weights for each fold are saved to:
saved_models/model_fold_{fold}.weights.h5
- Cross-validation results of the LlamaAffinity model across five folds:
| Fold | Accuracy | F1 Score | Precision | Recall | ROC AUC | Training (Minutes) |
|---|---|---|---|---|---|---|
| 0 | 0.9550 | 0.9548 | 0.9694 | 0.9406 | 0.9886 | 5.9600 |
| 1 | 0.9525 | 0.9533 | 0.9510 | 0.9557 | 0.9913 | 4.9559 |
| 2 | 0.9675 | 0.9674 | 0.9847 | 0.9507 | 0.9961 | 5.9042 |
| 3 | 0.9725 | 0.9728 | 0.9752 | 0.9704 | 0.9969 | 5.4559 |
| 4 | 0.9725 | 0.9730 | 0.9706 | 0.9754 | 0.9951 | 5.2030 |
| Average | 0.9640 | 0.9643 | 0.9702 | 0.9586 | 0.9936 | 27.4790 |
- Comparison of antibody affinity prediction models across performance metrics and training time
| Model | Accuracy | F1 Score | Precision | Recall | ROC AUC | Training (hours) |
|---|---|---|---|---|---|---|
| Transformer-6 L | 0.7865 | 0.7590 | 0.8060 | 0.7990 | 0.7930 | 0.38 |
| Transformer-12 L | 0.8011 | 0.7890 | 0.8310 | 0.8180 | 0.8290 | 0.63 |
| AntiBERTy | 0.8321 | 0.8510 | 0.9110 | 0.8910 | 0.9400 | 1.46 |
| AntiBERTa | 0.8796 | 0.8570 | 0.9080 | 0.9090 | 0.9340 | 2.97 |
| AntiFormer | 0.9169 | 0.8820 | 0.9630 | 0.9250 | 0.9660 | 0.76 |
| LlamaAffinity | 0.9640 | 0.9643 | 0.9702 | 0.9586 | 0.9936 | 0.46 |
- ✅ Confusion Matrix
- ✅ ROC Curve
- ✅ Loss & Accuracy plots
If you use this repository in your research, please cite:
@misc{<fillup>,
author = {Md Delower Hossain, Jake chen},
title = {LlamaAffinity: a compact sequence-only model for antibody binder propensity},
year = {2025},
publisher = {GitHub},
url = {https://github.com/aimed-lab/LlamaAffinity/blob/main/LlamaAffinity.ipynb}
}- QSong-github/AntiFormer for the original repo.
- HuggingFace
datasetsfor easy dataset handling. - keras-hub team for the LLaMA3 backbone implementation.
- Llama 3