Whole Slide Image (WSI) preprocessing toolkit for the Dr. Robert Gillies Machine Learning Workshop at Moffitt Cancer Center. This project demonstrates how to extract tissue patches from pathology slides and generate foundation model embeddings using the honeybee-ml library.
- Overview
- Installation
- Quick Start
- Tutorial: Extracting Tissue Patches
- Tutorial: Generating Embeddings
- Output Format
- Complete Example
- Troubleshooting
- References
This repository contains tools for preprocessing whole slide images (WSI) from cancer pathology datasets. The workflow includes:
- Loading WSI slides - Read large pathology images (e.g.,
.tiffiles) - Extracting tissue patches - Divide slides into manageable tiles
- Tissue detection - Identify high-quality tissue regions (classical + deep learning methods)
- Embedding generation - Extract 1024-dimensional feature vectors using UNI foundation model
- Saving outputs - Store embeddings with metadata for downstream analysis
Use Cases:
- Cancer survival prediction
- Tumor classification
- Biomarker discovery
- Multi-modal learning (pathology + clinical data)
- Python: 3.12+ (recommended)
- CUDA: 11.3+ (for GPU acceleration)
- Storage: ~2GB for models + space for slide data
-
Clone the repository:
git clone <repo-url> cd gillies-workshop-hackathon-2025
-
Run setup script:
bash setup.sh
This installs:
honeybee-ml- Medical imaging toolkithuggingface_hub- Model downloadingnumpy,matplotlib- Core scientific libraries- RAPIDS packages for GPU-accelerated data science
-
Download UNI model (done automatically on first run):
from huggingface_hub import hf_hub_download uni_model_path = hf_hub_download( repo_id='MahmoodLab/UNI', filename='pytorch_model.bin', cache_dir='./weights/uni_cache' )
from pathlib import Path
import numpy as np
from honeybee.processors import PathologyProcessor
from huggingface_hub import hf_hub_download
# 1. Initialize processor with UNI model
uni_model_path = hf_hub_download(
repo_id='MahmoodLab/UNI',
filename='pytorch_model.bin',
cache_dir='./weights/uni_cache'
)
processor = PathologyProcessor(model="uni", model_path=uni_model_path)
# 2. Load a WSI slide
wsi_path = Path("path/to/slide.tif")
wsi = processor.load_wsi(wsi_path, tile_size=128, max_patches=500)
# 3. Extract patches
patches = []
for addr in list(wsi.iterateTiles()):
tile = wsi.getTile(addr, writeToNumpy=True)
if tile is not None:
patches.append(tile[:, :, :3]) # RGB only
# 4. Generate embeddings
patches_array = np.stack(patches, axis=0)
embeddings = processor.generate_embeddings(patches_array, batch_size=32)
print(f"Generated {embeddings.shape[0]} embeddings of dimension {embeddings.shape[1]}")
# Output: Generated 500 embeddings of dimension 1024The PathologyProcessor class handles loading large WSI files and dividing them into tiles.
from pathlib import Path
from honeybee.processors import PathologyProcessor
# Initialize processor
processor = PathologyProcessor(model="uni", model_path="path/to/uni_model.bin")
# Load WSI with tiling parameters
wsi_path = Path("../hackathon-data/train/images/P0202.tif")
wsi = processor.load_wsi(
wsi_path,
tile_size=128, # Each tile is 128x128 pixels
max_patches=500, # Limit number of patches (optional)
verbose=False
)
# Inspect the loaded slide
print(f"Slide dimensions: {wsi.slide.width}x{wsi.slide.height}")
print(f"Tile size: {wsi.tileSize}x{wsi.tileSize}")
print(f"Tile grid: {wsi.numTilesInX}x{wsi.numTilesInY} = {wsi.numTilesInX * wsi.numTilesInY} tiles")Output:
Slide dimensions: 6000x7669
Tile size: 128x128
Tile grid: 46x59 = 2714 tiles
Iterate through the tile grid and extract patch images:
import numpy as np
# Pre-load all tiles
tiles_cache = {}
for addr in list(wsi.iterateTiles()):
tile = wsi.getTile(addr, writeToNumpy=True)
# Validate tile
if tile is not None and len(tile.shape) >= 3 and tile.shape[2] >= 3:
tile_rgb = tile[:, :, :3] # Extract RGB channels (drop alpha if present)
tiles_cache[addr] = tile_rgb
print(f"Extracted {len(tiles_cache)} patches")
print(f"Patch shape: {list(tiles_cache.values())[0].shape}")Output:
Extracted 2714 patches
Patch shape: (128, 128, 3)
Key Points:
addris a tuple(x, y)representing tile coordinates in the gridgetTile()returns numpy array with shape(H, W, C)- Some formats (e.g., TIFF) may have 4 channels (RGBA), so extract RGB with
[:, :, :3] - Caching tiles improves performance if you need them multiple times
Honeybee provides classical computer vision methods for tissue detection:
from PIL import Image
# Get thumbnail of the slide
thumbnail = np.asarray(wsi.slide)
if thumbnail.shape[-1] == 4:
thumbnail = thumbnail[:, :, :3]
# Method 1: Otsu thresholding
tissue_mask_otsu = processor.detect_tissue(thumbnail, method="otsu")
# Method 2: HSV color-based detection
tissue_mask_hsv = processor.detect_tissue(thumbnail, method="hsv")
# Method 3: Combined approach
tissue_mask_combined = processor.detect_tissue(thumbnail, method="otsu_hsv")
# Calculate tissue percentage
tissue_pct = np.sum(tissue_mask_otsu) / tissue_mask_otsu.size * 100
print(f"Tissue coverage: {tissue_pct:.1f}%")Available Methods:
"otsu"- Automatic thresholding based on grayscale histogram"hsv"- Color-based detection in HSV space (targets pinkish tissue)"otsu_hsv"- Intersection of both methods for higher precision
For more accurate tissue segmentation, use the deep learning tissue detector:
from honeybee.models.TissueDetector.tissue_detector import TissueDetector
# Load pre-trained DenseNet121 tissue detector
tissue_detector = TissueDetector(
model_path="./weights/deep-tissue-detector_densenet_state-dict.pt"
)
# Attach to WSI and run detection
wsi.tissue_detector = tissue_detector
wsi.detectTissue()
print(f"Analyzed {len(wsi.tileDictionary)} tiles")Understanding the Output:
The deep learning detector classifies each tile into 3 categories:
for address, tile_info in wsi.tileDictionary.items():
artifact_level = tile_info['artifactLevel'] # Probability of artifact (0-1)
background_level = tile_info['backgroundLevel'] # Probability of background (0-1)
tissue_level = tile_info['tissueLevel'] # Probability of tissue (0-1)
print(f"Tile {address}: Tissue={tissue_level:.3f}, Background={background_level:.3f}, Artifact={artifact_level:.3f}")Example Output:
Tile (0, 0): Tissue=0.023, Background=0.971, Artifact=0.006
Tile (5, 3): Tissue=0.892, Background=0.098, Artifact=0.010
Tile (10, 7): Tissue=0.754, Background=0.231, Artifact=0.015
Select only patches with high tissue probability:
TISSUE_THRESHOLD = 0.5 # Adjust based on your quality requirements
tissue_patches = []
tissue_coordinates = []
tissue_scores = []
for address, tile_info in wsi.tileDictionary.items():
if tile_info.get('tissueLevel', 0) > TISSUE_THRESHOLD:
# Get cached tile
if address in tiles_cache:
tissue_patches.append(tiles_cache[address])
tissue_coordinates.append(address)
tissue_scores.append(tile_info['tissueLevel'])
print(f"High-quality tissue patches: {len(tissue_patches)}")
print(f"Mean tissue score: {np.mean(tissue_scores):.3f}")Output:
High-quality tissue patches: 1090
Mean tissue score: 0.742
Threshold Selection Tips:
> 0.3- Permissive (includes some background)> 0.5- Balanced (recommended for most use cases)> 0.7- Strict (only high-confidence tissue)> 0.9- Very strict (excludes edge regions)
Foundation models like UNI extract rich feature representations from pathology images. These embeddings can be used for downstream tasks like survival prediction, tumor classification, and biomarker discovery.
UNI (Universal Pathology Foundation Model) generates 1024-dimensional embeddings:
from honeybee.processors import PathologyProcessor
from huggingface_hub import hf_hub_download
# Download UNI model from HuggingFace (cached after first download)
uni_model_path = hf_hub_download(
repo_id='MahmoodLab/UNI',
filename='pytorch_model.bin',
cache_dir='./weights/uni_cache'
)
print(f"UNI model ready: {uni_model_path}")
# Initialize processor with model path
processor = PathologyProcessor(model="uni", model_path=uni_model_path)Supported Models:
"uni"- UNI (1024-dim embeddings) - Requires model_path"virchow2"- Virchow2 (varies) - Requires model_path"remedis"- RemedIs (varies) - Requires model_path
Batch processing is much faster than processing patches one-by-one:
import numpy as np
# Stack patches into a single numpy array
patches_array = np.stack(tissue_patches, axis=0) # Shape: (N, 128, 128, 3)
print(f"Patches array shape: {patches_array.shape}")
# Generate embeddings in batches
embeddings = processor.generate_embeddings(
patches_array,
batch_size=32 # Process 32 patches at a time (adjust based on GPU memory)
)
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embedding stats - Mean: {embeddings.mean():.4f}, Std: {embeddings.std():.4f}")Output:
Patches array shape: (1090, 128, 128, 3)
Embeddings shape: (1090, 1024)
Embedding stats - Mean: 0.0073, Std: 1.2373
Batch Size Guidelines:
- GPU 8GB: batch_size=16-32
- GPU 16GB: batch_size=32-64
- GPU 24GB+: batch_size=64-128
If you only need to process one or a few patches:
# Single patch
single_patch = tissue_patches[0] # Shape: (128, 128, 3)
# Add batch dimension
single_patch_batch = np.expand_dims(single_patch, axis=0) # Shape: (1, 128, 128, 3)
# Generate embedding
embedding = processor.generate_embeddings(single_patch_batch, batch_size=1)
print(f"Single embedding shape: {embedding.shape}") # (1, 1024)UNI embeddings are designed to capture:
- Morphological features - Cell shapes, tissue architecture
- Staining patterns - H&E color distributions
- Spatial relationships - Cell density, organization
- Diagnostic patterns - Cancer vs. normal, tumor subtypes
# Analyze embedding statistics
print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"Non-zero features: {np.count_nonzero(embeddings[0])}")
print(f"Value range: [{embeddings.min():.3f}, {embeddings.max():.3f}]")
# Check for NaN/inf values
print(f"Contains NaN: {np.any(np.isnan(embeddings))}")
print(f"Contains Inf: {np.any(np.isinf(embeddings))}")Output:
Embedding dimension: 1024
Non-zero features: 1024
Value range: [-5.234, 6.891]
Contains NaN: False
Contains Inf: False
Save embeddings with metadata for downstream analysis:
import numpy as np
# Prepare metadata
metadata = {
'patient_id': 'P0202',
'model': 'uni',
'num_patches': len(tissue_patches),
'tile_size': 128,
'tissue_threshold': 0.5,
'slide_width': wsi.slide.width,
'slide_height': wsi.slide.height,
'embedding_dim': embeddings.shape[1]
}
# Save to compressed NPZ file
output_path = Path("./data/P0202_embeddings.npz")
np.savez_compressed(
output_path,
embeddings=embeddings, # (N, 1024) - Feature vectors
coordinates=np.array(tissue_coordinates), # (N, 2) - Tile positions
tissue_levels=np.array(tissue_scores), # (N,) - Quality scores
metadata=np.array(metadata, dtype=object) # Dict with metadata
)
print(f"Saved: {output_path} ({output_path.stat().st_size / 1024**2:.2f} MB)")import numpy as np
# Load NPZ file
data = np.load('./data/P0202_embeddings.npz', allow_pickle=True)
# Access individual arrays
embeddings = data['embeddings'] # (1090, 1024)
coordinates = data['coordinates'] # (1090, 2)
tissue_levels = data['tissue_levels'] # (1090,)
metadata = data['metadata'].item() # dict
# Print info
print(f"Patient: {metadata['patient_id']}")
print(f"Model: {metadata['model']}")
print(f"Patches: {len(embeddings)}")
print(f"Embedding dimension: {embeddings.shape[1]}")
# Verify data integrity
assert embeddings.shape[0] == coordinates.shape[0] == len(tissue_levels)
assert embeddings.shape[1] == 1024
assert np.all(tissue_levels > 0.5)
print("Data integrity verified")NPZ File Structure:
P0202_embeddings.npz (4.0 MB)
embeddings: (1090, 1024) float32 - UNI feature vectors
coordinates: (1090, 2) int64 - (x, y) tile grid positions
tissue_levels: (1090,) float32 - Tissue probability scores
metadata: dict - Slide and processing metadata