Fig. 1: Overview of the Pathfinder clustering pipeline, from PDB input to ranked representatives.
Pathfinder is a tool for clustering protein structure ensembles (e.g., from AlphaFold predictions) and selecting representative conformations. It supports dimensionality reduction via distance maps or TM-score matrices, followed by clustering using algorithms like K-means, hierarchical, DBSCAN, spectral, or GMM. Optional integration with reference structures enables ranking based on TM-align scores.
The pipeline processes PDB files in ensembles, extracts features, clusters them, and outputs ranked representatives with metrics.
- Feature Extraction: Residue distance maps or TM-score distance matrices of given ensemble(s).
- Clustering: Multiple algorithms with/without PCA dimensionality reduction.
- Ranking: Confidence-weighted selection and ranking; optional TM-score comparison to references.
- Parallelism: Multi-process support for efficiency.
- Batch Processing: Handle single proteins, multiple, or directories via a wrapper script.
- Python 3.8+ with NumPy, Pandas, scikit-learn, and SciPy.
- External tools (installed with conda):
- A conda environment (example provided in scripts).
- Clone the repository:
git clone https://github.com/yourusername/pathfinder.git cd pathfinder - Create and activate a mamba environment:
mamba create -n pathfinder -f environment.yml mamba activate pathfinder
Activate your environment and ensure src/ and scripts are in your PATH or current directory.
Run the test script to see an example:
chmod +x run_test.sh
./run_test.shpython src/main.py \
--ensemble_dir /path/to/ensemble/dir \
--output_dir /path/to/output/dir \
--cluster_method kmeans \
--n_clusters 10 \
--n_pca_components 10 \
--transformer tmscore \
--alpha 1.0 \
--n_processes 32 \
--ref_list_txt /path/to/refs.txt # Optionalcd dashapp
python app.py
Fig. 1: Overview of the ensemble analysis and state identification interactive utility
To be added