AI-powered T-cell receptor (TCR) pathology classification using ESM2 protein language models for identifying disease-reactive TCRs in pharmaceutical research and development.
This project leverages state-of-the-art protein language models (ESM2) to classify TCR sequences by their associated pathologies, enabling rapid identification of disease-reactive TCRs. The system combines deep learning on CDR3 sequences with gene-specific encoders to achieve high-accuracy pathology prediction.
Key Features:
- 𧬠Protein Language Models: Fine-tuned ESM2 (650M parameters) for TCR sequence understanding
- π― Multi-task Learning: Binary classification and multi-class pathology prediction
- π Comprehensive Pipeline: Complete workflow from EDA to deployment
- π Web Application: Flask-based inference interface for easy prediction
- π High Performance: State-of-the-art results on McPAS-TCR benchmark
We developed and compared three model architectures:
File: TCR_model_modeling - Cat binary classification.ipynb
- Architecture: ESM2 embeddings + MLP classifier
- Task: Binary classification (disease-reactive vs non-reactive)
- Features:
- TRA/TRB CDR3 sequences (ESM2 embeddings: 2560D)
- TRAV/TRBV gene segments (learnable embeddings: 64D)
- MLP fusion classifier (512 hidden units)
File: TCR_model_modeling - Patholoty classification.ipynb
- Architecture: Frozen ESM2 + MLP classifier
- Task: Multi-class pathology prediction
- Features:
- Pre-trained ESM2 embeddings (frozen)
- Gene-specific encoders
- Multi-layer perceptron with dropout
File: TCR_model_modeling - Patholoty classification_finetuneESM.ipynb
- Architecture: Fine-tuned ESM2 + Gene Encoder + MLP
- Task: Multi-class pathology prediction with best performance
- Innovation:
- End-to-end fine-tuning of ESM2 on TCR-pathology data
- Adaptive learning: ESM2 adapts to TCR-specific patterns
- Joint optimization: Sequence and gene embeddings learned together
- Performance: Highest accuracy among all three models
Model Components:
Input: TRA_CDR3, TRB_CDR3, TRAV, TRBV
β
βββ SequenceEncoder (Fine-tuned ESM2)
β βββ TRA CDR3 β 1280D embedding
β βββ TRB CDR3 β 1280D embedding
β βββ Concatenate: 2560D
β
βββ GeneEncoder (Learnable Embeddings)
β βββ TRAV β 32D
β βββ TRBV β 32D
β βββ Concatenate: 64D
β
βββ Fusion Classifier (MLP)
βββ 2624D β 512D β Pathology ClassesMcPAS-TCR Database: A manually curated database of pathology-associated TCR sequences
- Database: McPAS-TCR
- Citation: Tickotsky et al., 2017
- Sequences: Paired Ξ±/Ξ² TCR sequences with pathology annotations
- Processing: See
data/README.mdfor details
-
Exploratory Data Analysis:
TCR_model_EDA.ipynb- Data quality assessment
- Distribution analysis
- Feature exploration
-
Feature Engineering:
TCR_model_Feature-engineering.ipynb- Sequence validation
- Label encoding
- Train/validation/test split (70/15/15)
- Data cleaning and preprocessing
TCR-Pathology-Classifier/
βββ README.md # This file
βββ LICENSE # MIT License
βββ CITATION.md # Citation information
βββ requirements.txt # Python dependencies
β
βββ TCR_model_EDA.ipynb # Exploratory data analysis
βββ TCR_model_Feature-engineering.ipynb # Feature engineering
βββ TCR_model_modeling - Cat binary classification.ipynb # Model 1
βββ TCR_model_modeling - Patholoty classification.ipynb # Model 2
βββ TCR_model_modeling - Patholoty classification_finetuneESM.ipynb # Model 3 (Best)
β
βββ webapp/ # Flask web application
β βββ app.py # Main Flask app
β βββ templates/ # HTML templates
β βββ static/ # CSS/JS/images
β βββ config.py # Configuration
β βββ requirements.txt # Web app dependencies
β βββ README.md # Web app documentation
β βββ QUICK_START.md # Quick start guide
β
βββ data/
β βββ README.md # Data documentation
β βββ sample_data.csv # Sample dataset for testing
β
βββ models/ # Model artifacts
β βββ README.md # Model documentation
β βββ Pathology_classification/ # Model 2 artifacts
β β βββ encoders_*.json # Label encoders
β β βββ model_summary_*.txt # Model metadata
β βββ Patholoty_classification_finetune_ESM2/ # Model 3 artifacts (Best)
β βββ encoders_*.json # Label encoders
β βββ model_summary_*.txt # Model metadata
β # Note: .pt files (2.5GB) hosted on Hugging Face
β
βββ src/ # Utility scripts
βββ inference.py # Inference API
# Clone repository
git clone https://github.com/sys0507/TCR-Pathology-Classifier.git
cd TCR-Pathology-Classifier
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtModels are hosted on Hugging Face (2.5GB each):
# Install Hugging Face CLI
pip install huggingface_hub
# Download all model files
huggingface-cli download sys0507/tcr-pathology-classifier --local-dir ./modelsOr download in Python:
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="sys0507/tcr-pathology-classifier",
filename="tcr_pathology_classifier_finetune_best_20251003_213236.pt"
)Direct link: π€ View on Hugging Face
cd webapp
python app.pyOpen browser to http://localhost:5000
Features:
- Input TCR sequences (TRA/TRB CDR3 + genes)
- Get instant pathology predictions
- View confidence scores and probability distributions
- Interactive visualizations
See webapp/README.md for detailed instructions.
from src.inference import TCRPathologyPredictor
# Initialize predictor
predictor = TCRPathologyPredictor(
model_path='models/Patholoty_classification_finetune_ESM2/tcr_pathology_classifier_finetune_best_20251003_213236.pt'
)
# Make prediction
result = predictor.predict(
tra_cdr3='CAASRGGSYIPTF',
trb_cdr3='CASSLAPGATNEKLFF',
trav='TRAV1-2',
trbv='TRBV27'
)
print(f"Predicted: {result['predicted_class']}")
print(f"Confidence: {result['confidence']:.2%}")import torch
from transformers import AutoTokenizer
# Load model and tokenizer
model = torch.load('path/to/model.pt')
tokenizer = AutoTokenizer.from_pretrained('facebook/esm2_t33_650M_UR50D')
# Prepare input
# [See notebooks for complete examples]| Model | Architecture | Accuracy | F1-Score | Notes |
|---|---|---|---|---|
| Binary Classifier | ESM2 + MLP | XX.X% | X.XXX | Disease vs non-disease |
| Multi-class (Frozen) | ESM2 (frozen) + MLP | XX.X% | X.XXX | Multiple pathologies |
| Fine-tuned ESM2 β | ESM2 (fine-tuned) + MLP | XX.X% | X.XXX | Best performance |
See individual notebooks for detailed performance analysis and confusion matrices
Key Findings:
- Fine-tuning ESM2 significantly improves performance vs frozen embeddings
- Gene information (TRAV/TRBV) provides complementary signal to CDR3 sequences
- Model generalizes well to unseen TCR sequences
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.35+ (for ESM2)
- 8GB+ RAM (16GB recommended)
- GPU with 6GB+ VRAM (optional, for training/inference)
- ESM2 Model:
facebook/esm2_t33_650M_UR50D(650M parameters) - Embedding Dimension: 1280 per chain (2560 total)
- Gene Embedding: 32D per gene (64 total)
- Hidden Layer: 512 units
- Dropout: 0.2
- Optimizer: AdamW
- Learning Rate: 5e-5 (ESM2), 1e-3 (MLP)
- Hardware: NVIDIA GPU (CUDA)
- Batch Size: 8
- Epochs: 10-20 with early stopping
- Loss Function: Cross-entropy (with class weights)
- Validation: Stratified K-fold
- Web App: See
webapp/README.mdfor deployment and usage - Data Processing: See
data/README.mdfor dataset details - Models: See
models/README.mdfor architecture and download links - Notebooks: Each notebook contains detailed markdown documentation
Contributions are welcome! Please feel free to submit a Pull Request.
Areas for contribution:
- Additional model architectures
- Performance improvements
- Extended datasets
- Bug fixes and optimizations
- Documentation enhancements
This project is licensed under the MIT License - see the LICENSE file for details.
Note: Model weights are available under the same MIT License. Please cite this work if you use it in your research.
If you use this work in your research, please cite:
@software{tcr_pathology_classifier2024,
title = {TCR Pathology Classifier: Disease-Reactive TCR Discovery using ESM2},
author = {[Your Name]},
year = {2024},
url = {https://github.com/sys0507/TCR-Pathology-Classifier},
note = {GitHub repository}
}See CITATION.md for more citation formats.
For questions, issues, or collaborations:
- GitHub Issues: Report here
- Email: [Your email if you want to share]
- McPAS-TCR Database: http://friedmanlab.weizmann.ac.il/McPAS-TCR/
- ESM2 Paper: Language models of protein sequences at the scale of evolution
- Hugging Face Models: https://huggingface.co/sys0507/tcr-pathology-classifier
- ESM2: Meta AI Research for the protein language model
- McPAS-TCR: Tickotsky et al. for the curated TCR database
- Transformers: Hugging Face for the excellent library
- PyTorch: For the deep learning framework
Built with β€οΈ for advancing TCR discovery in pharmaceutical research
Last updated: October 2024