Skip to content

tsenoner/protspace

Repository files navigation

ProtSpace

PyPI version Python 3.10+ License: GPL v3 Downloads DOI

ProtSpace is a visualization tool for exploring protein embeddings or similarity matrices along their 3D protein structures. It allows users to interactively visualize high-dimensional protein language model data in 2D or 3D space, color-code proteins based on various features, and view protein structures when available.

🌐 Try Online

Web Interface: https://protspace.rostlab.org/

New JavaScript Frontend (in development): https://tsenoner.github.io/protspace_web -> Drag & drop .parquetbundle files

πŸš€ Quick Start with Google Colab

Note: Use Chrome or Firefox for best experience.

  1. Explore Pre-computed Visualizations: Open Explorer In Colab

  2. Generate Protein Embeddings: Open Embeddings In Colab

  3. Full Pipeline Demo: Open Pipeline In Colab

  4. Pfam & Clan Explorer: Open Pfam Explorer In Colab

πŸ“¦ Installation

# Basic installation (backend - dimensionality reduction only)
pip install protspace

# Full installation (backend + frontend - including visualization interface)
pip install "protspace[frontend]"

🎯 Quick Start

1. Query UniProt directly

# Retrieve and analyze proteins from UniProt using sequence similarity (mmmseqs2)
protspace-query -q "(ft_domain:phosphatase) AND (reviewed:true)" -o output_dir -m pca2,pca3,umap2 -f "protein_families,fragment,kingdom,superfamily" --n_neighbors 30 --min_dist 0.4

2. Process local data

# Analyse and vizualise your locally stored embeddings
protspace-local -i embeddings.h5 -o output_dir -m pca2,umap2

3. Launch visualization

protspace output_dir

Access at http://localhost:8050

πŸ“Š Example Outputs

2D Scatter Plot

2D Example

3D Interactive Plot

View 3D Example

✨ Features

  • Multiple projections: PCA, UMAP, t-SNE, MDS, PaCMAP in 2D/3D
  • Automatic feature extraction: Use -f to color-code proteins by UniProt, InterPro, or Taxonomy features
  • 3D structure viewer: Integrated protein structure visualization
  • Export: SVG (2D) and HTML (3D) formats

Available Features (use with -f)

UniProt: annotation_score, cc_subcellular_location, fragment, length_fixed, length_quantile, protein_existence, protein_families, reviewed, xref_pdb

InterPro: cath, pfam, signal_peptide, superfamily

Taxonomy: root, domain, kingdom, phylum, class, order, family, genus, species

Examples:

# Extract Pfam domains and subcellular location
protspace-local -i data.h5 -f pfam,cath,cc_subcellular_location

# Extract reviewed status, length, and taxonomy
protspace-query -q "..." -f reviewed,length_quantile,kingdom

πŸ”§ Advanced Usage

Command Options

protspace-local (Local data):

  • -i, --input: HDF5 embeddings or CSV similarity matrix (required)
  • -o, --output: Output file or directory (optional, default: derived from input filename)
  • -f, --features: Features to extract (comma-separated) or CSV metadata file path
  • -m, --methods: Reduction methods (e.g., pca2,umap3,tsne2)
  • --non-binary: Use legacy JSON format
  • --keep-tmp: Cache intermediate files for reuse
  • --bundled: Bundle output files (true/false, default: true)

protspace-query (UniProt search):

  • -q, --query: UniProt search query (required)
  • -o, --output: Output file or directory (optional, default: protspace.parquetbundle)
  • -f, --features: Features to extract (comma-separated)
  • -m, --methods: Reduction methods (e.g., pca2,umap3,tsne2)
  • --non-binary: Use legacy JSON format
  • --keep-tmp: Cache intermediate files for reuse
  • --bundled: Bundle output files (true/false, default: true)

Method Default Parameters

Followng the default parameters for each method. Override these to fine-tune dimensionality reduction:

  • UMAP: --n_neighbors 15 --min_dist 0.1
  • t-SNE: --perplexity 30 --learning_rate 200
  • PaCMAP: --mn_ratio 0.5 --fp_ratio 2.0
  • MDS: --n_init 4 --max_iter 300 --eps 1e-3

Custom Styling

protspace-feature-colors input.json output.json --feature_styles '{
  "feature_name": {
    "colors": {"value1": "#FF0000", "value2": "#00FF00"},
    "shapes": {"value1": "circle", "value2": "square"}
  }
}'

Available shapes: circle, circle-open, cross, diamond, diamond-open, square, square-open, x

πŸ“ File Formats

Input

  • UniProt queries: Text queries using UniProt syntax
  • Embeddings: HDF5 files (.h5, .hdf5)
  • Similarity matrices: CSV files with symmetric matrices
  • Metadata: CSV with 'identifier' column + feature columns
  • Structures: ZIP files containing PDB/CIF files

Output

  • Default: Parquet files (projections_data.parquet, projections_metadata.parquet, selected_features.parquet)
  • Legacy: JSON format with --non-binary flag
  • Temporary files: FASTA sequences, similarity matrices, all features (with --keep-tmp)

πŸ“ Citation

@article{SENONER2025168940,
title = {ProtSpace: A Tool for Visualizing Protein Space},
journal = {Journal of Molecular Biology},
pages = {168940},
year = {2025},
issn = {0022-2836},
doi = {https://doi.org/10.1016/j.jmb.2025.168940},
url = {https://www.sciencedirect.com/science/article/pii/S0022283625000063},
author = {Tobias Senoner and Tobias Olenyi and Michael Heinzinger and Anton Spannagl and George Bouras and Burkhard Rost and Ivan Koludarov}
}

About

ProtSpace is program that helps in visualizing and investigating our protein embedding space.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 6

Languages