TCR Pathology Classifier - Disease-Reactive TCR Discovery

AI-powered T-cell receptor (TCR) pathology classification using ESM2 protein language models for identifying disease-reactive TCRs in pharmaceutical research and development.

🎯 Overview

This project leverages state-of-the-art protein language models (ESM2) to classify TCR sequences by their associated pathologies, enabling rapid identification of disease-reactive TCRs. The system combines deep learning on CDR3 sequences with gene-specific encoders to achieve high-accuracy pathology prediction.

Key Features:

🧬 Protein Language Models: Fine-tuned ESM2 (650M parameters) for TCR sequence understanding
🎯 Multi-task Learning: Binary classification and multi-class pathology prediction
📊 Comprehensive Pipeline: Complete workflow from EDA to deployment
🌐 Web Application: Flask-based inference interface for easy prediction
📈 High Performance: State-of-the-art results on McPAS-TCR benchmark

🏗️ Model Architectures

We developed and compared three model architectures:

1. Binary Classification Model

File: TCR_model_modeling - Cat binary classification.ipynb

Architecture: ESM2 embeddings + MLP classifier
Task: Binary classification (disease-reactive vs non-reactive)
Features:
- TRA/TRB CDR3 sequences (ESM2 embeddings: 2560D)
- TRAV/TRBV gene segments (learnable embeddings: 64D)
- MLP fusion classifier (512 hidden units)

2. Multi-class Pathology Classification

File: TCR_model_modeling - Patholoty classification.ipynb

Architecture: Frozen ESM2 + MLP classifier
Task: Multi-class pathology prediction
Features:
- Pre-trained ESM2 embeddings (frozen)
- Gene-specific encoders
- Multi-layer perceptron with dropout

3. Fine-tuned ESM2 Classification (Best Model) ⭐

File: TCR_model_modeling - Patholoty classification_finetuneESM.ipynb

Architecture: Fine-tuned ESM2 + Gene Encoder + MLP
Task: Multi-class pathology prediction with best performance
Innovation:
- End-to-end fine-tuning of ESM2 on TCR-pathology data
- Adaptive learning: ESM2 adapts to TCR-specific patterns
- Joint optimization: Sequence and gene embeddings learned together
Performance: Highest accuracy among all three models

Model Components:

Input: TRA_CDR3, TRB_CDR3, TRAV, TRBV
  │
  ├─→ SequenceEncoder (Fine-tuned ESM2)
  │   ├─→ TRA CDR3 → 1280D embedding
  │   ├─→ TRB CDR3 → 1280D embedding
  │   └─→ Concatenate: 2560D
  │
  ├─→ GeneEncoder (Learnable Embeddings)
  │   ├─→ TRAV → 32D
  │   ├─→ TRBV → 32D
  │   └─→ Concatenate: 64D
  │
  └─→ Fusion Classifier (MLP)
      └─→ 2624D → 512D → Pathology Classes

📊 Dataset

Source

McPAS-TCR Database: A manually curated database of pathology-associated TCR sequences

Database: McPAS-TCR
Citation: Tickotsky et al., 2017
Sequences: Paired α/β TCR sequences with pathology annotations
Processing: See data/README.md for details

Data Processing Pipeline

Exploratory Data Analysis: TCR_model_EDA.ipynb
- Data quality assessment
- Distribution analysis
- Feature exploration
Feature Engineering: TCR_model_Feature-engineering.ipynb
- Sequence validation
- Label encoding
- Train/validation/test split (70/15/15)
- Data cleaning and preprocessing

📁 Repository Structure

TCR-Pathology-Classifier/
├── README.md                                      # This file
├── LICENSE                                        # MIT License
├── CITATION.md                                    # Citation information
├── requirements.txt                               # Python dependencies
│
├── TCR_model_EDA.ipynb                           # Exploratory data analysis
├── TCR_model_Feature-engineering.ipynb           # Feature engineering
├── TCR_model_modeling - Cat binary classification.ipynb        # Model 1
├── TCR_model_modeling - Patholoty  classification.ipynb        # Model 2
├── TCR_model_modeling - Patholoty classification_finetuneESM.ipynb  # Model 3 (Best)
│
├── webapp/                                        # Flask web application
│   ├── app.py                                    # Main Flask app
│   ├── templates/                                # HTML templates
│   ├── static/                                   # CSS/JS/images
│   ├── config.py                                 # Configuration
│   ├── requirements.txt                          # Web app dependencies
│   ├── README.md                                 # Web app documentation
│   └── QUICK_START.md                           # Quick start guide
│
├── data/
│   ├── README.md                                 # Data documentation
│   └── sample_data.csv                           # Sample dataset for testing
│
├── models/                                        # Model artifacts
│   ├── README.md                                 # Model documentation
│   ├── Pathology_classification/                 # Model 2 artifacts
│   │   ├── encoders_*.json                      # Label encoders
│   │   └── model_summary_*.txt                  # Model metadata
│   └── Patholoty_classification_finetune_ESM2/  # Model 3 artifacts (Best)
│       ├── encoders_*.json                      # Label encoders
│       └── model_summary_*.txt                  # Model metadata
│       # Note: .pt files (2.5GB) hosted on Hugging Face
│
└── src/                                           # Utility scripts
    └── inference.py                              # Inference API

🚀 Quick Start

1. Installation

# Clone repository
git clone https://github.com/sys0507/TCR-Pathology-Classifier.git
cd TCR-Pathology-Classifier

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Download Models

Models are hosted on Hugging Face (2.5GB each):

# Install Hugging Face CLI
pip install huggingface_hub

# Download all model files
huggingface-cli download sys0507/tcr-pathology-classifier --local-dir ./models

Or download in Python:

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="sys0507/tcr-pathology-classifier",
    filename="tcr_pathology_classifier_finetune_best_20251003_213236.pt"
)

Direct link: 🤗 View on Hugging Face

3. Run Web Application

cd webapp
python app.py

Open browser to http://localhost:5000

Features:

Input TCR sequences (TRA/TRB CDR3 + genes)
Get instant pathology predictions
View confidence scores and probability distributions
Interactive visualizations

See webapp/README.md for detailed instructions.

🔬 Usage Examples

Python API (using `src/inference.py`)

from src.inference import TCRPathologyPredictor

# Initialize predictor
predictor = TCRPathologyPredictor(
    model_path='models/Patholoty_classification_finetune_ESM2/tcr_pathology_classifier_finetune_best_20251003_213236.pt'
)

# Make prediction
result = predictor.predict(
    tra_cdr3='CAASRGGSYIPTF',
    trb_cdr3='CASSLAPGATNEKLFF',
    trav='TRAV1-2',
    trbv='TRBV27'
)

print(f"Predicted: {result['predicted_class']}")
print(f"Confidence: {result['confidence']:.2%}")

Jupyter Notebook

import torch
from transformers import AutoTokenizer

# Load model and tokenizer
model = torch.load('path/to/model.pt')
tokenizer = AutoTokenizer.from_pretrained('facebook/esm2_t33_650M_UR50D')

# Prepare input
# [See notebooks for complete examples]

📈 Performance Metrics

Model	Architecture	Accuracy	F1-Score	Notes
Binary Classifier	ESM2 + MLP	XX.X%	X.XXX	Disease vs non-disease
Multi-class (Frozen)	ESM2 (frozen) + MLP	XX.X%	X.XXX	Multiple pathologies
Fine-tuned ESM2 ⭐	ESM2 (fine-tuned) + MLP	XX.X%	X.XXX	Best performance

See individual notebooks for detailed performance analysis and confusion matrices

Key Findings:

Fine-tuning ESM2 significantly improves performance vs frozen embeddings
Gene information (TRAV/TRBV) provides complementary signal to CDR3 sequences
Model generalizes well to unseen TCR sequences

🛠️ Technical Details

Requirements

Python 3.8+
PyTorch 2.0+
Transformers 4.35+ (for ESM2)
8GB+ RAM (16GB recommended)
GPU with 6GB+ VRAM (optional, for training/inference)

Model Specifications

ESM2 Model: facebook/esm2_t33_650M_UR50D (650M parameters)
Embedding Dimension: 1280 per chain (2560 total)
Gene Embedding: 32D per gene (64 total)
Hidden Layer: 512 units
Dropout: 0.2
Optimizer: AdamW
Learning Rate: 5e-5 (ESM2), 1e-3 (MLP)

Training Details

Hardware: NVIDIA GPU (CUDA)
Batch Size: 8
Epochs: 10-20 with early stopping
Loss Function: Cross-entropy (with class weights)
Validation: Stratified K-fold

📚 Documentation

Web App: See webapp/README.md for deployment and usage
Data Processing: See data/README.md for dataset details
Models: See models/README.md for architecture and download links
Notebooks: Each notebook contains detailed markdown documentation

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Areas for contribution:

Additional model architectures
Performance improvements
Extended datasets
Bug fixes and optimizations
Documentation enhancements

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: Model weights are available under the same MIT License. Please cite this work if you use it in your research.

🙏 Citation

If you use this work in your research, please cite:

@software{tcr_pathology_classifier2024,
  title = {TCR Pathology Classifier: Disease-Reactive TCR Discovery using ESM2},
  author = {[Your Name]},
  year = {2024},
  url = {https://github.com/sys0507/TCR-Pathology-Classifier},
  note = {GitHub repository}
}

See CITATION.md for more citation formats.

📧 Contact

For questions, issues, or collaborations:

GitHub Issues: Report here
Email: [Your email if you want to share]

🔗 Related Resources

McPAS-TCR Database: http://friedmanlab.weizmann.ac.il/McPAS-TCR/
ESM2 Paper: Language models of protein sequences at the scale of evolution
Hugging Face Models: https://huggingface.co/sys0507/tcr-pathology-classifier

🌟 Acknowledgments

ESM2: Meta AI Research for the protein language model
McPAS-TCR: Tickotsky et al. for the curated TCR database
Transformers: Hugging Face for the excellent library
PyTorch: For the deep learning framework

Built with ❤️ for advancing TCR discovery in pharmaceutical research

Last updated: October 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TCR Pathology Classifier - Disease-Reactive TCR Discovery

🎯 Overview

🏗️ Model Architectures

1. Binary Classification Model

2. Multi-class Pathology Classification

3. Fine-tuned ESM2 Classification (Best Model) ⭐

📊 Dataset

Source

Data Processing Pipeline

📁 Repository Structure

🚀 Quick Start

1. Installation

2. Download Models

3. Run Web Application

🔬 Usage Examples

Python API (using `src/inference.py`)

Jupyter Notebook

📈 Performance Metrics

🛠️ Technical Details

Requirements

Model Specifications

Training Details

📚 Documentation

🤝 Contributing

📄 License

🙏 Citation

📧 Contact

🔗 Related Resources

🌟 Acknowledgments

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
TCR_indentifier		TCR_indentifier
data		data
demo video		demo video
model		model
models		models
src		src
webapp		webapp
.gitignore		.gitignore
CITATION.md		CITATION.md
LICENSE		LICENSE
README.md		README.md
TCR_model_EDA.ipynb		TCR_model_EDA.ipynb
TCR_model_Feature-engineering.ipynb		TCR_model_Feature-engineering.ipynb
TCR_model_modeling - Cat binary classification.ipynb		TCR_model_modeling - Cat binary classification.ipynb
TCR_model_modeling - Pathology classification_finetuneESM_CeDPrediction.ipynb		TCR_model_modeling - Pathology classification_finetuneESM_CeDPrediction.ipynb
TCR_model_modeling - Patholoty classification.ipynb		TCR_model_modeling - Patholoty classification.ipynb
requirements.txt		requirements.txt

License

sys0507/TCR-Pathology-Classifier

Folders and files

Latest commit

History

Repository files navigation

TCR Pathology Classifier - Disease-Reactive TCR Discovery

🎯 Overview

🏗️ Model Architectures

1. Binary Classification Model

2. Multi-class Pathology Classification

3. Fine-tuned ESM2 Classification (Best Model) ⭐

📊 Dataset

Source

Data Processing Pipeline

📁 Repository Structure

🚀 Quick Start

1. Installation

2. Download Models

3. Run Web Application

🔬 Usage Examples

Python API (using src/inference.py)

Jupyter Notebook

📈 Performance Metrics

🛠️ Technical Details

Requirements

Model Specifications

Training Details

📚 Documentation

🤝 Contributing

📄 License

🙏 Citation

📧 Contact

🔗 Related Resources

🌟 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Python API (using `src/inference.py`)

Packages