🧠 PathBench: Evaluating Vision and Pathology Foundation Models for Computational Pathology

A Comprehensive Benchmark Study

👥 Authors

Rohan Bareja¹, Francisco Carrillo-Perez¹, Yuanning Zheng¹, Marija Pizurica¹
Tarak Nath Nandi², Jeanne Shen³, Ravi Madduri², Olivier Gevaert¹

¹Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine
²Data Science and Learning Division, Argonne National Laboratory
³Department of Pathology, Stanford University, School of Medicine

📝 Abstract

To advance precision medicine in pathology, robust AI-driven foundation models are increasingly needed to uncover complex patterns in large-scale pathology datasets, enabling more accurate disease detection, classification, and prognostic insights. However, despite substantial progress in deep learning and computer vision, the comparative performance and generalizability of these pathology foundation models across diverse histopathological datasets and tasks remain largely unexamined. In this study, we conduct a comprehensive benchmarking of 31 AI foundation models for computational pathology, including general vision models (VM), general vision-language models (VLM), pathology-specific vision models (Path-VM), and pathology-specific vision-language models (Path-VLM), evaluated over 41 tasks sourced from TCGA, CPTAC, external benchmarking datasets, and out-of-domain datasets. Across TCGA, CPTAC, and external benchmarks, Virchow2 consistently performed at the top, alongside Prov-GigaPath, H-optimus-0, and UNI all of which ranked among the leading models. Pairwise comparisons revealed no statistically significant differences among these top models, highlighting their comparable performance and robustness across diverse histopathological tasks. We also show that Path-VM outperformed both Path-VLM and VM, securing top rankings across tasks despite lacking a statistically significant edge over vision models. Our findings reveal that model size and data size did not consistently correlate with improved performance in pathology foundation models, challenging assumptions about scaling in histopathological applications. Lastly, our study demonstrates that a fusion model, integrating top-performing foundation models, achieved superior generalization across external tasks and diverse tissues in histopathological analysis. These findings emphasize the need for further research to understand the underlying factors influencing model performance and to develop strategies that enhance the generalizability and robustness of pathology-specific vision foundation models across different tissue types and datasets.

🔬 Overview

This repository accompanies our paper:

Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study
medRxiv preprint, May 2025

We benchmark 31 foundation models across 41 computational pathology tasks, including:

General-purpose Vision Models (VM)
Vision-Language Models (VLM)
Pathology-specific Vision Models (Path-VM)
Pathology-specific Vision-Language Models (Path-VLM)

We evaluate performance across data from TCGA, CPTAC, and several external out-of-domain datasets. Tasks include tumor classification, molecular subtyping, tumor stage, and pathway prediction.

📊 PathBench

You can explore the complete benchmark results interactively via our web portal:

PathBench

📈 Key Findings

Virchow2 achieved the highest performance across TCGA, CPTAC, and external datasets.
Path-VM models outperformed both VLMs and general-purpose VMs on average.
Model size and dataset size were not reliably associated with better performance.
A fusion model combining top-performing encoders generalized best across tissue types and institutions.

📁 Repository Structure

.
├── dashboard/              # Dashboard code (e.g., Streamlit app)
├── data/                   # Data used for the dashboard (summaries, plots, results)
├── environments/           # Conda environment YAML files
│   └── linear_eval.yml     # Recommended environment for model evaluation
├── models/                 # Vision transformer model code
├── scripts/                # Linear evaluation scripts for benchmarking
├── README.md               # Project overview and setup instructions

🚀 Getting Started

⚙️ System Requirements

Operating system(s) tested: Linux (tested on SUSE Linux Enterprise Server 15 SP6; expected to run on other modern Linux distributions such as Ubuntu 22.04 or CentOS 7)
Dependencies: fully specified in environments/linear_eval.yml
Hardware: standard x86_64 CPU; GPU recommended for faster model evaluation
Typical install time for conda environment on a "normal" desktop computer: ~10–15 minutes

Clone the repository

git clone https://github.com/gevaertlab/benchmarking-path-models.git
cd benchmarking-path-models

Set up the Conda environment We recommend using the provided Conda environment for reproducibility:

conda env create -f environments/linear_eval.yml
conda activate linear_eval

Patch extraction To extract patches from whole-slide images (WSIs), please use the script src/patch_gen_hdf5.py. An example script to run the patch extraction: src/submit_patch_gen_hdf5.sh
Example: Run evaluation script for UNI

python -m torch.distributed.launch \
  --master_port $RANDOM \
  --nproc_per_node=4 \
  /home/rbareja/dino/eval_linear_uni.py \
  --patch_data_path _Patches256x256_hdf5/ \
  --train_csv_path ../tcga_cancer_metadata/brain_meta/tcga_ref_brain_IDHmut_train_fold0.csv \
  --val_csv_path ../tcga_cancer_metadata/brain_meta/tcga_ref_brain_IDHmut_val_fold0.csv \
  --test_csv_path ../tcga_cancer_metadata/brain_meta/tcga_ref_brain_IDHmut_test.csv \
  --no_aug \
  --img_size=256 \
  --max_patches_total=500 \
  --bag_size=50 \
  --test_max_patches_total=500 \
  --test_bag_size=500 \
  --output_dir ../eval_brain/IDHmut_classification/"$out_dir"/ \
  --train_from_scratch no \
  --num_workers=2 \
  --batch_size_per_gpu 16 \
  --test_batch_size_per_gpu 2 \
  --num_labels 2 \
  --arch "$arch" \
  --patch_size="$p_size" \
  --epochs 30 \
  --evaluate \
  --pretrained_weights "$p_weights" \
  > ../eval_brain/IDHmut_classification/"$out_dir"/logtesdata.txt

💻 Computational Requirements

Model inference time depends on the model, task, and dataset size.
- Typically, evaluation of a single model takes a couple of hours on a standard desktop with 4 GPUs or a moderately-sized CPU cluster.
Scripts support parallel evaluation via PyTorch Distributed for multi-GPU setups.
Exact runtime may vary depending on hardware, batch size, and data preprocessing.

📖 Citation

If you use this work in your research, please cite our preprint:

Bareja R, Carrillo-Perez F, Zheng Y, Pizurica M, Nandi TN, Shen J, Madduri R, Gevaert O. Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study. medRxiv, 2025. https://doi.org/10.1101/2025.05.08.25327250

📄 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 PathBench: Evaluating Vision and Pathology Foundation Models for Computational Pathology

👥 Authors

📝 Abstract

🔬 Overview

📊 PathBench

📈 Key Findings

📁 Repository Structure

🚀 Getting Started

⚙️ System Requirements

💻 Computational Requirements

📖 Citation

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
dashboard		dashboard
data		data
environments		environments
examples/brain_meta		examples/brain_meta
models		models
scripts		scripts
src		src
README.md		README.md

gevaertlab/benchmarking-path-models

Folders and files

Latest commit

History

Repository files navigation

🧠 PathBench: Evaluating Vision and Pathology Foundation Models for Computational Pathology

👥 Authors

📝 Abstract

🔬 Overview

📊 PathBench

📈 Key Findings

📁 Repository Structure

🚀 Getting Started

⚙️ System Requirements

💻 Computational Requirements

📖 Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages