Skip to content

wallnerlab/pathfinder

Repository files navigation

Pathfinder: Protein Structure Ensemble Clustering and Representative Selection

License: MIT Python 3.8+

Protocol Fig. 1: Overview of the Pathfinder clustering pipeline, from PDB input to ranked representatives.

Pathfinder is a tool for clustering protein structure ensembles (e.g., from AlphaFold predictions) and selecting representative conformations. It supports dimensionality reduction via distance maps or TM-score matrices, followed by clustering using algorithms like K-means, hierarchical, DBSCAN, spectral, or GMM. Optional integration with reference structures enables ranking based on TM-align scores.

The pipeline processes PDB files in ensembles, extracts features, clusters them, and outputs ranked representatives with metrics.

Features

  • Feature Extraction: Residue distance maps or TM-score distance matrices of given ensemble(s).
  • Clustering: Multiple algorithms with/without PCA dimensionality reduction.
  • Ranking: Confidence-weighted selection and ranking; optional TM-score comparison to references.
  • Parallelism: Multi-process support for efficiency.
  • Batch Processing: Handle single proteins, multiple, or directories via a wrapper script.

Prerequisites

  • Python 3.8+ with NumPy, Pandas, scikit-learn, and SciPy.
  • External tools (installed with conda):
  • A conda environment (example provided in scripts).

Installation

  1. Clone the repository:
    git clone https://github.com/yourusername/pathfinder.git
    cd pathfinder
  2. Create and activate a mamba environment:
    mamba create -n pathfinder -f environment.yml 
    mamba activate pathfinder
    

Quick Start

Activate your environment and ensure src/ and scripts are in your PATH or current directory.

Run the test script to see an example:

chmod +x run_test.sh
./run_test.sh
python src/main.py \
    --ensemble_dir /path/to/ensemble/dir \
    --output_dir /path/to/output/dir \
    --cluster_method kmeans \
    --n_clusters 10 \
    --n_pca_components 10 \
    --transformer tmscore \
    --alpha 1.0 \
    --n_processes 32 \
    --ref_list_txt /path/to/refs.txt  # Optional

Interactive ensemble analysis and state identification

cd dashapp
python app.py

Protocol Fig. 1: Overview of the ensemble analysis and state identification interactive utility

Cite

To be added

About

General purpose resource for analysing protein ensembles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages