Skip to content

sys0507/TCR-Pathology-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TCR Pathology Classifier - Disease-Reactive TCR Discovery

Python 3.8+ PyTorch License: MIT Hugging Face

AI-powered T-cell receptor (TCR) pathology classification using ESM2 protein language models for identifying disease-reactive TCRs in pharmaceutical research and development.


🎯 Overview

This project leverages state-of-the-art protein language models (ESM2) to classify TCR sequences by their associated pathologies, enabling rapid identification of disease-reactive TCRs. The system combines deep learning on CDR3 sequences with gene-specific encoders to achieve high-accuracy pathology prediction.

Key Features:

  • 🧬 Protein Language Models: Fine-tuned ESM2 (650M parameters) for TCR sequence understanding
  • 🎯 Multi-task Learning: Binary classification and multi-class pathology prediction
  • πŸ“Š Comprehensive Pipeline: Complete workflow from EDA to deployment
  • 🌐 Web Application: Flask-based inference interface for easy prediction
  • πŸ“ˆ High Performance: State-of-the-art results on McPAS-TCR benchmark

πŸ—οΈ Model Architectures

We developed and compared three model architectures:

1. Binary Classification Model

File: TCR_model_modeling - Cat binary classification.ipynb

  • Architecture: ESM2 embeddings + MLP classifier
  • Task: Binary classification (disease-reactive vs non-reactive)
  • Features:
    • TRA/TRB CDR3 sequences (ESM2 embeddings: 2560D)
    • TRAV/TRBV gene segments (learnable embeddings: 64D)
    • MLP fusion classifier (512 hidden units)

2. Multi-class Pathology Classification

File: TCR_model_modeling - Patholoty classification.ipynb

  • Architecture: Frozen ESM2 + MLP classifier
  • Task: Multi-class pathology prediction
  • Features:
    • Pre-trained ESM2 embeddings (frozen)
    • Gene-specific encoders
    • Multi-layer perceptron with dropout

3. Fine-tuned ESM2 Classification (Best Model) ⭐

File: TCR_model_modeling - Patholoty classification_finetuneESM.ipynb

  • Architecture: Fine-tuned ESM2 + Gene Encoder + MLP
  • Task: Multi-class pathology prediction with best performance
  • Innovation:
    • End-to-end fine-tuning of ESM2 on TCR-pathology data
    • Adaptive learning: ESM2 adapts to TCR-specific patterns
    • Joint optimization: Sequence and gene embeddings learned together
  • Performance: Highest accuracy among all three models

Model Components:

Input: TRA_CDR3, TRB_CDR3, TRAV, TRBV
  β”‚
  β”œβ”€β†’ SequenceEncoder (Fine-tuned ESM2)
  β”‚   β”œβ”€β†’ TRA CDR3 β†’ 1280D embedding
  β”‚   β”œβ”€β†’ TRB CDR3 β†’ 1280D embedding
  β”‚   └─→ Concatenate: 2560D
  β”‚
  β”œβ”€β†’ GeneEncoder (Learnable Embeddings)
  β”‚   β”œβ”€β†’ TRAV β†’ 32D
  β”‚   β”œβ”€β†’ TRBV β†’ 32D
  β”‚   └─→ Concatenate: 64D
  β”‚
  └─→ Fusion Classifier (MLP)
      └─→ 2624D β†’ 512D β†’ Pathology Classes

πŸ“Š Dataset

Source

McPAS-TCR Database: A manually curated database of pathology-associated TCR sequences

  • Database: McPAS-TCR
  • Citation: Tickotsky et al., 2017
  • Sequences: Paired Ξ±/Ξ² TCR sequences with pathology annotations
  • Processing: See data/README.md for details

Data Processing Pipeline

  1. Exploratory Data Analysis: TCR_model_EDA.ipynb

    • Data quality assessment
    • Distribution analysis
    • Feature exploration
  2. Feature Engineering: TCR_model_Feature-engineering.ipynb

    • Sequence validation
    • Label encoding
    • Train/validation/test split (70/15/15)
    • Data cleaning and preprocessing

πŸ“ Repository Structure

TCR-Pathology-Classifier/
β”œβ”€β”€ README.md                                      # This file
β”œβ”€β”€ LICENSE                                        # MIT License
β”œβ”€β”€ CITATION.md                                    # Citation information
β”œβ”€β”€ requirements.txt                               # Python dependencies
β”‚
β”œβ”€β”€ TCR_model_EDA.ipynb                           # Exploratory data analysis
β”œβ”€β”€ TCR_model_Feature-engineering.ipynb           # Feature engineering
β”œβ”€β”€ TCR_model_modeling - Cat binary classification.ipynb        # Model 1
β”œβ”€β”€ TCR_model_modeling - Patholoty  classification.ipynb        # Model 2
β”œβ”€β”€ TCR_model_modeling - Patholoty classification_finetuneESM.ipynb  # Model 3 (Best)
β”‚
β”œβ”€β”€ webapp/                                        # Flask web application
β”‚   β”œβ”€β”€ app.py                                    # Main Flask app
β”‚   β”œβ”€β”€ templates/                                # HTML templates
β”‚   β”œβ”€β”€ static/                                   # CSS/JS/images
β”‚   β”œβ”€β”€ config.py                                 # Configuration
β”‚   β”œβ”€β”€ requirements.txt                          # Web app dependencies
β”‚   β”œβ”€β”€ README.md                                 # Web app documentation
β”‚   └── QUICK_START.md                           # Quick start guide
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ README.md                                 # Data documentation
β”‚   └── sample_data.csv                           # Sample dataset for testing
β”‚
β”œβ”€β”€ models/                                        # Model artifacts
β”‚   β”œβ”€β”€ README.md                                 # Model documentation
β”‚   β”œβ”€β”€ Pathology_classification/                 # Model 2 artifacts
β”‚   β”‚   β”œβ”€β”€ encoders_*.json                      # Label encoders
β”‚   β”‚   └── model_summary_*.txt                  # Model metadata
β”‚   └── Patholoty_classification_finetune_ESM2/  # Model 3 artifacts (Best)
β”‚       β”œβ”€β”€ encoders_*.json                      # Label encoders
β”‚       └── model_summary_*.txt                  # Model metadata
β”‚       # Note: .pt files (2.5GB) hosted on Hugging Face
β”‚
└── src/                                           # Utility scripts
    └── inference.py                              # Inference API

πŸš€ Quick Start

1. Installation

# Clone repository
git clone https://github.com/sys0507/TCR-Pathology-Classifier.git
cd TCR-Pathology-Classifier

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Download Models

Models are hosted on Hugging Face (2.5GB each):

# Install Hugging Face CLI
pip install huggingface_hub

# Download all model files
huggingface-cli download sys0507/tcr-pathology-classifier --local-dir ./models

Or download in Python:

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="sys0507/tcr-pathology-classifier",
    filename="tcr_pathology_classifier_finetune_best_20251003_213236.pt"
)

Direct link: πŸ€— View on Hugging Face

3. Run Web Application

cd webapp
python app.py

Open browser to http://localhost:5000

Features:

  • Input TCR sequences (TRA/TRB CDR3 + genes)
  • Get instant pathology predictions
  • View confidence scores and probability distributions
  • Interactive visualizations

See webapp/README.md for detailed instructions.


πŸ”¬ Usage Examples

Python API (using src/inference.py)

from src.inference import TCRPathologyPredictor

# Initialize predictor
predictor = TCRPathologyPredictor(
    model_path='models/Patholoty_classification_finetune_ESM2/tcr_pathology_classifier_finetune_best_20251003_213236.pt'
)

# Make prediction
result = predictor.predict(
    tra_cdr3='CAASRGGSYIPTF',
    trb_cdr3='CASSLAPGATNEKLFF',
    trav='TRAV1-2',
    trbv='TRBV27'
)

print(f"Predicted: {result['predicted_class']}")
print(f"Confidence: {result['confidence']:.2%}")

Jupyter Notebook

import torch
from transformers import AutoTokenizer

# Load model and tokenizer
model = torch.load('path/to/model.pt')
tokenizer = AutoTokenizer.from_pretrained('facebook/esm2_t33_650M_UR50D')

# Prepare input
# [See notebooks for complete examples]

πŸ“ˆ Performance Metrics

Model Architecture Accuracy F1-Score Notes
Binary Classifier ESM2 + MLP XX.X% X.XXX Disease vs non-disease
Multi-class (Frozen) ESM2 (frozen) + MLP XX.X% X.XXX Multiple pathologies
Fine-tuned ESM2 ⭐ ESM2 (fine-tuned) + MLP XX.X% X.XXX Best performance

See individual notebooks for detailed performance analysis and confusion matrices

Key Findings:

  • Fine-tuning ESM2 significantly improves performance vs frozen embeddings
  • Gene information (TRAV/TRBV) provides complementary signal to CDR3 sequences
  • Model generalizes well to unseen TCR sequences

πŸ› οΈ Technical Details

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers 4.35+ (for ESM2)
  • 8GB+ RAM (16GB recommended)
  • GPU with 6GB+ VRAM (optional, for training/inference)

Model Specifications

  • ESM2 Model: facebook/esm2_t33_650M_UR50D (650M parameters)
  • Embedding Dimension: 1280 per chain (2560 total)
  • Gene Embedding: 32D per gene (64 total)
  • Hidden Layer: 512 units
  • Dropout: 0.2
  • Optimizer: AdamW
  • Learning Rate: 5e-5 (ESM2), 1e-3 (MLP)

Training Details

  • Hardware: NVIDIA GPU (CUDA)
  • Batch Size: 8
  • Epochs: 10-20 with early stopping
  • Loss Function: Cross-entropy (with class weights)
  • Validation: Stratified K-fold

πŸ“š Documentation

  • Web App: See webapp/README.md for deployment and usage
  • Data Processing: See data/README.md for dataset details
  • Models: See models/README.md for architecture and download links
  • Notebooks: Each notebook contains detailed markdown documentation

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Areas for contribution:

  • Additional model architectures
  • Performance improvements
  • Extended datasets
  • Bug fixes and optimizations
  • Documentation enhancements

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: Model weights are available under the same MIT License. Please cite this work if you use it in your research.


πŸ™ Citation

If you use this work in your research, please cite:

@software{tcr_pathology_classifier2024,
  title = {TCR Pathology Classifier: Disease-Reactive TCR Discovery using ESM2},
  author = {[Your Name]},
  year = {2024},
  url = {https://github.com/sys0507/TCR-Pathology-Classifier},
  note = {GitHub repository}
}

See CITATION.md for more citation formats.


πŸ“§ Contact

For questions, issues, or collaborations:

  • GitHub Issues: Report here
  • Email: [Your email if you want to share]

πŸ”— Related Resources


🌟 Acknowledgments

  • ESM2: Meta AI Research for the protein language model
  • McPAS-TCR: Tickotsky et al. for the curated TCR database
  • Transformers: Hugging Face for the excellent library
  • PyTorch: For the deep learning framework

Built with ❀️ for advancing TCR discovery in pharmaceutical research

Last updated: October 2024