FastPLMs is an open-source initiative dedicated to making protein language models (pLMs) efficient and easy to use. By replacing native, often suboptimal attention implementations with Flash Attention or Flex Attention, we provide high-performance alternatives that are fully compatible with the HuggingFace transformers ecosystem and can easily be loaded with no extra code with AutoModel.
- Introduction
- Documentation
- Supported Models
- Attention Backends
- Embedding & Pooling
- Concrete Examples
- Testing & Benchmarking
- Installation & Docker
Detailed documentation is available in the docs/ folder:
- Architecture Overview - How FastPLMs wraps official models, the attention backend system, Docker layout
- Per-Model Guides - Loading, configuration, and special handling for each model family
- Attention Backends - SDPA, Flash, Flex, Auto: how they work, when to use each, numerical properties
- Embedding & Pooling API - Pooler strategies,
embed_dataset()parameters, SQLite/pth storage - Fine-Tuning Guide - LoRA, Trainer patterns, dataset classes, metrics
- Testing & Benchmarking - Docker commands, pytest markers, compliance architecture, throughput benchmarks
- Contributing - Code style, adding new models, required tests
Protein Language Models are transformer-based architectures trained on massive datasets of protein sequences (such as UniProt). These models learn the "grammar" of proteins, capturing evolutionary information, structural constraints, and functional motifs. They are used for:
- Representation Learning: Generating high-dimensional embeddings for downstream tasks (e.g., stability, function prediction).
- Protein Generation: Designing novel sequences with specific properties.
- Structure Prediction: Mapping sequences to their 3D folds (e.g., Boltz2).
FastPLMs provides optimized versions of these models. Our focus is on:
- Speed: Drastically faster inference through optimized attention kernels.
- Memory Efficiency: Lower VRAM usage, enabling larger batch sizes or longer sequences.
- Seamless Integration: Use
AutoModel.from_pretrained(..., trust_remote_code=True)to load our optimized weights directly from HuggingFace.
We maintain a comprehensive HuggingFace Collection of optimized models. Below is a summary of the supported families and their origins.
| Model Family | Organization | Official Implementation | FastPLMs Optimization | Checkpoints |
|---|---|---|---|---|
| E1 | Profluent Bio | Profluent-Bio/E1 | Flex Attention, Block-Causal | 150M, 300M, 600M |
| ESM2 | Meta AI | facebookresearch/esm | Flash (SDPA) / Flex Attention | 8M, 35M, 150M, 650M, 3B |
| ESM++ | EvolutionaryScale | EvolutionaryScale/esm | Optimized SDPA / Flex | Small (300M), Large (600M) |
| DPLM | ByteDance | bytedance/dplm | Diffusion Optimized Attention | 150M, 650M, 3B |
| DPLM2 | ByteDance | bytedance/dplm | Multimodal Diffusion | 150M, 650M, 3B |
| ANKH | Elnaggar Lab | ElnaggarLab/ankh | T5 RPE via Flex score_mod | Base, Large, ANKH2-L, ANKH3-L, ANKH3-XL |
| ESMFold | Meta AI | facebookresearch/esm | ProteinTTT + Fast ESM2 backbone | Standard |
| Boltz2 | MIT / Various | jwohlwend/boltz | Optimized Structure Prediction | Standard |
All FastPLMs models share a common set of attention backends, controlled via config.attn_backend. The default is "sdpa", which is safe on all hardware and numerically equivalent to standard attention.
| Backend | Key | Speed | Numerical Equivalence | Availability |
|---|---|---|---|---|
| PyTorch SDPA | "sdpa" |
Fast | Exact | Any PyTorch ≥ 2.0 |
| Flash Attention | "kernels_flash" |
Fastest | Approximate | Requires pip install kernels (pre-built) |
| Flex Attention | "flex" |
Very fast | ~Exact | Requires PyTorch ≥ 2.11 (FA4 backend on Hopper/Blackwell) |
| Auto | "auto" |
— | — | Always (selects best available) |
PyTorch's scaled_dot_product_attention dispatches to a fused CUDA kernel (cuDNN or efficient attention) that is faster and more memory-efficient than naive attention, while being mathematically identical to it. This is the recommended default for reproducibility and general use. It is also the only backend where output_attentions=True is handled natively; with other backends, attentions are computed via a separate naive matrix multiplication when requested.
Flash Attention 2 and 3 are typically the fastest options on Ampere (A100) and Hopper (H100) GPUs, often 2–4× faster than SDPA at long sequence lengths. Flash Attention achieves this by tiling the computation and applying an online softmax, which means the results are not bitwise identical to SDPA or naive attention. Differences are on the order of floating-point rounding and are often inconsequential for standard inference — but they are not guaranteed to be so. They can compound across layers, interact with low-precision dtypes (fp16/bf16), or affect sensitive downstream tasks. Flash Attention is standard practice in large model training and the trade-off is well understood, but it should not be treated as a drop-in numerical equivalent of SDPA. If exact reproducibility or numerical sensitivity is a concern, use "sdpa" instead.
No compilation required. FastPLMs uses the HuggingFace kernels package to load pre-built Flash Attention 2/3 binaries at runtime — no C++ compiler, no CUDA toolkit version pinning, no waiting:
pip install kernels
Building flash-attn from source is notoriously painful. The Ninja build system parallelizes aggressively across all available CPU cores, and each NVCC/CICC compiler process it spawns can consume 5–8 GB of RAM on its own. On a 64-core machine this can push peak RAM usage to ~300 GB, and even on a throttled single-threaded build (MAX_JOBS=1 NVCC_THREADS=1) the compile still takes many hours while grinding through paging. Pre-built community wheels cover 384+ version/GPU/CUDA/platform combinations and still routinely fall short of matching a user's exact environment. This is the point where most people give up and go without Flash Attention entirely. The kernels package sidesteps all of this by fetching a pre-compiled binary matched to your GPU architecture (SM80 for Ampere, SM90 for Hopper). If no compatible binary exists for your hardware, it gracefully falls back to flex or sdpa rather than erroring.
PyTorch's flex_attention (PyTorch ≥ 2.5) generates a fused Triton kernel customized to the mask pattern at hand. It is numerically very close to SDPA — typically within floating-point rounding of naive computation. The primary advantage is that it can apply a block mask that skips padding tokens entirely, providing a meaningful speedup on batches with variable-length sequences (no compute wasted on padding). E1 uses a block-causal variant of this mask.
The first forward pass triggers JIT compilation via Triton, which can take 30–120 seconds. All subsequent calls are fast. Combining with torch.compile yields the best sustained throughput.
Automatically selects the best available backend in order of preference: kernels_flash → flex → sdpa. Useful when you want maximum speed without configuring the environment manually, and you accept that the resolved backend may differ across machines.
At load time (all models):
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("Synthyra/ESM2-150M", trust_remote_code=True)
config.attn_backend = "flex" # "sdpa", "kernels_flash", "flex", or "auto"
model = AutoModel.from_pretrained("Synthyra/ESM2-150M", config=config, trust_remote_code=True)After load time (DPLM and DPLM2 only):
DPLM and DPLM2 expose an attn_backend property on the model that propagates the change to all attention layers immediately:
model = AutoModel.from_pretrained("Synthyra/DPLM-150M", trust_remote_code=True)
model.attn_backend = "flex" # updates every attention layer in-placeFor ESM2, E1, and ESM++, the backend must be set on the config before calling from_pretrained.
All backends support output_attentions=True. For the optimized backends (SDPA, Flash Attention, Flex), attention weights are computed via a separate naive matrix multiplication and appended to the output — so enabling this negates the memory savings of those backends. Use it only for inspection or contact prediction, not during high-throughput inference.
The EmbeddingMixin (shared across all models) provides a standardized way to extract representations from proteins.
The Pooler class aggregates sequence-level residue representations into a single fixed-size vector. Supported strategies include:
mean: Mask-aware average of all residues.cls: The first token's representation (Standard for classification).max: Element-wise maximum across the sequence.var/std: Variance or Standard Deviation of representations.norm: L2 normalization.median: Element-wise median.parti: Experimental PageRank-based attention pooling.
Ideal for embedding millions of sequences where you need to stream data or avoid OOM on RAM.
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("Synthyra/ESM2-150M", trust_remote_code=True).cuda()
sequences = ["MALWMRLLPLLALLALWGPDPAAA", "MKTIIALSYIFCLVFA", ...]
# Embed and store in SQLite
model.embed_dataset(
sequences=sequences,
batch_size=64,
pooling_types=['mean', 'cls'], # Concatenates both
sql=True,
sql_db_path='large_protein_db.db',
embed_dtype=torch.float32
)Pass a FASTA file path directly — no manual parsing required. Multi-line sequences are handled automatically. You can combine fasta_path with an explicit sequences list and the two sources are merged before embedding.
# Embed all sequences in a FASTA file and save to SQLite
model.embed_dataset(
fasta_path='my_proteins.fasta',
batch_size=64,
pooling_types=['mean'],
sql=True,
sql_db_path='my_proteins.db',
)
# Mix a FASTA file with an explicit list
model.embed_dataset(
sequences=["MKTIIALSYIFCLVFA"],
fasta_path='additional_proteins.fasta',
batch_size=32,
save=True,
save_path='combined_embeddings.pth',
)Perfect for medium-sized datasets that fit in memory.
# Embed and return as a dictionary
embeddings = model.embed_dataset(
sequences=sequences,
batch_size=128,
pooling_types=['mean'],
save=True,
save_path='my_embeddings.pth'
)
# Access embedding
seq_vector = embeddings["MALWMRLLPLLALLALWGPDPAAA"] # torch.TensorConcatenate multiple mathematical representations for richer downstream features.
# Use a variety of pooling types
embeddings = model.embed_dataset(
sequences=sequences,
pooling_types=['mean', 'max', 'std', 'var'], # All 4 concatenated
batch_size=32,
full_embeddings=False
)
# Resulting vector size: 4 * hidden_size
print(embeddings[sequences[0]].shape)FastPLMs includes a pytest-based test suite under testing/ covering correctness, compliance, and performance. All GPU tests run inside Docker. See docs/testing.md for the full guide.
| Test | What it checks | Marker |
|---|---|---|
| AutoModel loading | Every model loads via AutoModelForMaskedLM.from_pretrained(..., trust_remote_code=True) and produces valid outputs |
gpu |
| Backend consistency | SDPA, Flex, and Flash backends produce equivalent predictions (>= 95% agreement) | gpu |
| Weight compliance | FastPLM weights are bit-exact with the original implementations (ESM2, ESMC, E1, DPLM) | slow, gpu |
| Forward compliance | Forward pass logits/predictions match the originals within tolerance | slow, gpu |
| NaN stability | Batched inference with padding produces no NaN in real-token embeddings | gpu |
| Batch-single match | Batch and single-item embedding produce identical results | gpu |
| Full model suite | All of the above across every checkpoint (8M through 3B) | gpu, large |
| Throughput benchmark | Tokens/sec across models, backends, batch sizes, and sequence lengths | slow, gpu |
| Structure models | Boltz2 and ESMFold loading + forward pass | structure, slow, gpu |
# Build the image
docker build -t fastplms .
# Fast tests (small models, no compliance, no structure)
docker run --gpus all fastplms python -m pytest /app/testing/ -m "gpu and not slow and not large and not structure" -v
# All sequence model tests except 3B
docker run --gpus all fastplms python -m pytest /app/testing/ -m "not large and not structure" -v
# Full suite including 3B models (requires 40+ GB VRAM)
docker run --gpus all fastplms python -m pytest /app/testing/ -m "not structure" -v
# Structure models only (Boltz2, ESMFold)
docker run --gpus all fastplms python -m pytest /app/testing/ -m "structure" -v
# Everything
docker run --gpus all fastplms python -m pytest /app/testing/ -v
# Single model family
docker run --gpus all fastplms python -m pytest /app/testing/ -k esm2 -vOn Windows, replace ${PWD} with $(pwd).
Weight and forward compliance tests compare FastPLM outputs against the original model implementations. These require additional packages:
| Dependency | Purpose | Install |
|---|---|---|
esm |
EvolutionaryScale's ESMC reference models | pip install esm |
E1 |
Profluent-Bio's E1 reference models (Python >= 3.12) | pip install E1 @ git+https://github.com/Profluent-AI/E1.git |
ESM2 and DPLM compliance use HuggingFace transformers directly (no extra packages). If a compliance dependency is not installed, those tests are skipped. All compliance dependencies are pre-installed in the Docker image.
Throughput can be measured via the pytest test (saves structured JSON/CSV/PNG results) or the standalone script (more configurable).
# Pytest (benchmarks ESM2-8M, ESMplusplus_small, DPLM-150M, DPLM2-150M across all backends)
docker run --gpus all -v $(pwd):/workspace fastplms python -m pytest /app/testing/test_throughput.py -v -s
# Output: throughput_results.json, throughput_results.csv, throughput_comparison.png
# Standalone (fully configurable)
docker run --gpus all -v $(pwd):/workspace fastplms \
python -m testing.throughput \
--model_paths Synthyra/ESM2-8M Synthyra/ESMplusplus_small \
--backends sdpa flex kernels_flash \
--batch_sizes 2 4 8 \
--sequence_lengths 64 128 256 512 1024 2048 \
--output_path /workspace/throughput_comparison.pnggit clone --recurse-submodules https://github.com/Synthyra/FastPLMs.git
cd FastPLMs
pip install -r requirements.txtIf you already cloned without --recurse-submodules, initialize submodules separately:
git submodule update --init --recursiveThe Dockerfile includes CUDA 12.8, all Python dependencies, and official reference repos (E1, DPLM) installed from official/ submodules for compliance testing.
# Initialize submodules (required before building Docker)
git submodule update --init --recursive
# Build the image
docker build -t fastplms .
# Run all tests
docker run --gpus all fastplms python -m pytest /app/testing/ -v
# Interactive shell
docker run --gpus all -v $(pwd):/workspace -it fastplms bashOn Linux/macOS, replace $(pwd) with ${PWD}.
Found a bug or have a feature request? Please open a GitHub Issue. We are actively looking for contributions to optimize more pLM architectures!
If you use FastPLMs, please cite the following along with the relevant model paper(s).
@misc{FastPLMs,
author={Hallee, Logan and Bichara, David and Gleghorn, Jason P.},
title={FastPLMs: Fast, efficient, protein language model inference from Huggingface AutoModel.},
year={2024},
url={https://huggingface.co/Synthyra/ESMplusplus_small},
DOI={10.57967/hf/3726},
publisher={Hugging Face}
}@article{dong2024flexattention,
title={Flex Attention: A Programming Model for Generating Optimized Attention Kernels},
author={Dong, Juechu and Feng, Boyuan and Guessous, Driss and Liang, Yanbo and He, Horace},
journal={arXiv preprint arXiv:2412.05496},
year={2024}
}@inproceedings{paszke2019pytorch,
title={PyTorch: An Imperative Style, High-Performance Deep Learning Library},
author={Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and K{\"o}pf, Andreas and Yang, Edward and DeVito, Zach and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
booktitle={Advances in Neural Information Processing Systems 32},
year={2019}
}@article{lin2023esm2,
title={Evolutionary-scale prediction of atomic-level protein structure with a language model},
author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smestad, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander},
journal={Science},
volume={379},
number={6637},
pages={1123--1130},
year={2023},
DOI={10.1126/science.ade2574}
}@article{hayes2024simulating,
title={Simulating 500 million years of evolution with a language model},
author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofber, Nicholas J and Achour, Divya and Moez, Irfan and Garg, Rhitu and Angelova, Rami and Babu, Manan and Alcaide, Eric and others},
journal={bioRxiv},
year={2024}
}@article{jain2025e1,
title={E1: Retrieval-Augmented Protein Encoder Models},
author={Jain, Sarthak and Beazer, Joel and Ruffolo, Jeffrey A and Bhatnagar, Aadyot and Madani, Ali},
journal={bioRxiv},
DOI={10.1101/2025.11.12.688125},
year={2025}
}@article{wang2024dplm,
title={Diffusion Language Models Are Versatile Protein Learners},
author={Wang, Xinyou and Ye, Zaixiang and Huang, Fei and Cao, Dongyan and Liang, Shujian and Huang, Liang},
journal={Proceedings of the 41st International Conference on Machine Learning},
year={2024}
}@article{wang2024dplm2,
title={DPLM-2: A Multimodal Diffusion Protein Language Model},
author={Wang, Xinyou and Ye, Zaixiang and Huang, Fei and Cao, Dongyan and Liang, Shujian and Huang, Liang},
journal={arXiv preprint arXiv:2410.13782},
year={2024}
}@article{elnaggar2023ankh,
title={Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling},
author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
journal={arXiv preprint arXiv:2301.06568},
year={2023}
}@article{alsamkary2025ankh3,
title={Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations},
author={Alsamkary, Hazem and Elshaffei, Mohamed and Elkerdawy, Mohamed and Elnaggar, Ahmed},
journal={arXiv preprint arXiv:2505.20052},
year={2025}
}@article{passaro2025boltz2,
title={Boltz-2: Exploring the Frontiers of Biomolecular Prediction},
author={Passaro, Saro and Corso, Gabriele and Wohlwend, Jeremy and Reveiz, Mateo and Bordes, Florian and Wicky, Basile and Dayan, Peter and Jing, Bowen},
journal={bioRxiv},
year={2025}
}@article{wohlwend2024boltz1,
title={Boltz-1: Democratizing Biomolecular Interaction Modeling},
author={Wohlwend, Jeremy and Corso, Gabriele and Passaro, Saro and Reveiz, Mateo and Leidal, Ken and Swanson, Wojtek and Kher, Gilmer and Lember, Tommi and Jaakkola, Tommi},
journal={bioRxiv},
year={2024}
}@misc{bushuiev2026proteinneed,
title={One protein is all you need},
author={Anton Bushuiev and Roman Bushuiev and Olga Pimenova and Nikola Zadorozhny and Raman Samusevich and Elisabet Manaskova and Rachel Seongeun Kim and Hannes St\"ark and Jiri Sedlar and Martin Steinegger and Tom\'a\v{s} Pluskal and Josef Sivic},
year={2026},
eprint={2411.02109},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.02109}
}