diff --git a/INSTALLATION.md b/INSTALLATION.md new file mode 100644 index 0000000..7325f67 --- /dev/null +++ b/INSTALLATION.md @@ -0,0 +1,31 @@ +## Installation + +### Prerequisites + +- Go 1.21+ +- Node.js 18+ +- fpocket (`sudo apt-get install fpocket` on Debian/Ubuntu, or build from [source](https://github.com/Discngine/fpocket)) +- Open Babel (`sudo apt-get install openbabel`) + +### Backend + +```bash +# Clone the repository +git clone https://github.com/ayush00git/ProtPocket.git +cd ProtPocket + +# Run the backend (port 8000) +go run main.go +``` + +### Frontend + +```bash +cd app +npm install +npm run dev +# Runs at localhost:5173 +# Proxies /api/* to localhost:8000 +``` + +--- diff --git a/README.md b/README.md index c461a0a..31054d8 100644 --- a/README.md +++ b/README.md @@ -1,125 +1,134 @@ -
- ProtPocket Logo -

ProtPocket : protein complex intelligence

-

End-to-End Drug Lead Generation via the Disorder Delta

-
+# ProtPocket ---- +**From protein name to ranked drug binding sites β€” automated, in seconds.** -## πŸ”¬ Abstract -ProtPocket is an advanced computational biology pipeline explicitly engineered to discover, analyze, and target **undrugged protein complexes**. Operating at the intersection of structural informatics and rational drug design, ProtPocket leverages the AlphaFold Protein Structure Database to find cryptic, high-value binding sites that only emerge during protein-protein interactions (PPIs). +ProtPocket is an open-source computational drug discovery tool that takes a protein name, gene symbol, disease, or UniProt accession as input and returns a complete structural analysis: real-time complex data from AlphaFold, drug target prioritization via an original Gap Score algorithm, interactive 3D structure comparison, automated binding site detection using fpocket, and fragment molecule suggestions from ChEMBL β€” all in one browser-based workflow. -By automating geometric cavity detection (`fpocket`), virtual screening, and molecular docking (AutoDock Vina), ProtPocket accelerates the hit-to-lead workflow for some of the world's most neglected pathogens and hardest-to-treat human diseases. +It was built on top of the AlphaFold homodimer dataset released March 16, 2026 by EMBL-EBI, Google DeepMind, NVIDIA, and Seoul National University β€” the largest protein complex dataset ever assembled. ProtPocket is, to our knowledge, the first tool to make this dataset queryable by drug discovery priority through a live API pipeline. --- -## 🧬 Theoretical Architecture: The "Disorder Delta" +## Table of Contents + +1. [The Problem](#the-problem) +2. [How ProtPocket Works](#how-protpocket-works) +3. [Technical Discovery: The AlphaFold Complex API](#technical-discovery) +4. [The Gap Score](#the-gap-score) +5. [Binding Site Detection](#binding-site-detection) +6. [Data Sources](#data-sources) +7. [Architecture](#architecture) +8. [Installation](#installation) +9. [API Reference](#api-reference) +10. [Roadmap](#roadmap) + +--- -The core innovation of ProtPocket lies in its structural gap scoring mechanism, which we term the **Disorder Delta**. +## The Problem -Many critical proteins, especially transcription factors and viral entry proteins, contain Intrinsically Disordered Regions (IDRs). As singular monomers, these regions are highly flexible and lack a defined 3D structure, rendering them effectively "undruggable" by traditional small molecules. +Protein structures have been the foundation of rational drug design for decades. When researchers know the three-dimensional shape of a protein involved in disease, they can in principle design a molecule that fits into a cavity on its surface and disrupts its function. The challenge has always been bridging the gap between having a structure and knowing where and how to target it. -However, upon binding to an interaction partner (forming a homodimer or heterodimer), these disordered regions often undergo a structural phase transition, folding into highly stable confirmations. +The traditional workflow is brutally fragmented. A researcher investigating a tuberculosis protein today must query AlphaFold manually for the structure, visit UniProt separately for disease context, run ChEMBL queries independently for drug coverage, download structure files locally, run pocket detection software from a command line, and then consult fragment databases with another tool entirely. Each step requires a different interface, produces output in a different format, and demands familiarity with a different tool. Most researchers do not have access to expensive commercial suites β€” SchrΓΆdinger, MOE, Discovery Studio β€” that partially unify these workflows. Even those who do still face the deeper problem that most of these tools operate on monomer structures. -ProtPocket exploits this behavior: -1. **Monomer Confidence (pLDDT)**: We evaluate the AlphaFold pLDDT (Predicted Local Distance Difference Test) of the monomer. Low pLDDT (<50) indicates high intrinsic disorder. -2. **Complex Confidence (pLDDT)**: We evaluate the pLDDT of the same sequence within the predicted homodimeric complex. A dramatic rise in pLDDT (>80) highlights a region forced into stability by the interaction. -3. **The Delta (Ξ”)**: The difference between these two scores (`Ξ” = Complex pLDDT - Monomer pLDDT`) flags the theoretical interface. -4. **Targeting the Interface**: By directing geometric cavity detection algorithms precisely at regions with a high Disorder Delta, we identify cryptic pockets that exist *only* in the functional complex state. Inhibiting these pockets theoretically prevents the protein-protein interaction from occurring at all. +A monomer is a single protein chain in isolation. A homodimer is two identical chains bound together. The biological reality is that most proteins only execute their functional role as dimers or larger complexes β€” the monomer form exists as a folding intermediate or transport state, not the active species inside the cell. The interface between two chains when they come together creates surface cavities β€” pockets β€” that do not exist in either chain alone. These interface pockets are among the most valuable drug targets in modern pharmacology, the basis of protein-protein interaction (PPI) inhibitor programs. Yet they are invisible to any tool that analyzes monomers only. -This targeted approach zeroes in on the most vulnerable, mechanistically critical regions of a pathogen's structural machinery. +The March 2026 AlphaFold homodimer release changed the availability of complex structural data fundamentally. But it provided no tooling to query the data by drug discovery priority, no way to run pocket analysis on the new structures programmatically, and no connection to fragment databases. The dataset existed but was not actionable. --- -## βš™οΈ The ProtPocket Pipeline +## How ProtPocket Works + +### Query Classification and Multi-Database Retrieval + +When a researcher submits a query β€” whether it is a gene name like `TP53`, a disease term like `tuberculosis`, a UniProt accession like `P04637`, or an AlphaFold ID like `AF-0000000066503175` β€” ProtPocket first classifies the query type. A UniProt accession goes directly to AlphaFold without a search step. A gene name hits UniProt with a gene-exact filter. A disease term queries UniProt's disease annotation index. An AlphaFold ID bypasses both and resolves immediately. + +For each matching protein, ProtPocket fires three concurrent requests: to AlphaFold for both monomer and homodimer predictions, to ChEMBL for approved drug coverage, and to UniProt for disease associations and organism context. These run in parallel via Go goroutines and merge before the response is returned. -The platform operates autonomously through a 5-step pipeline: +### Disorder Delta and Structural Comparison -### 1. Discover (Knowledge Graphing) -We ingest and sift through the 1.7 million structures in the AlphaFold Complex Database. We aggressively cross-reference this structural data against the **ChEMBL** pharmacological database and the **World Health Organization (WHO) Priority Pathogens list**. Any target with 0 known approved drugs that belongs to a high-threat pathogen (e.g., *M. tuberculosis*, *A. baumannii*) or aggressive human disease (e.g., *TP53*, *MYC*) is elevated to the dashboard. +For every protein, ProtPocket computes the disorder delta β€” the difference in average pLDDT confidence between the monomer and homodimer AlphaFold predictions. This single number captures the structural reveal: how much the protein gains in ordered, confident structure when it finds its binding partner. A disorder delta of +36 means the protein went from 50% structural confidence in isolation to 86% confidence in complex form β€” the functional shape was completely hidden in the monomer and emerged only in the dimer. -### 2. Reveal (Disorder Delta Calculation) -The pipeline calculates the thermodynamic Disorder Delta for all filtered targets. Proteins showcasing a massive phase transition from chaotic monomer to stable dimer are ranked at the top of the **Undrugged Target Leaderboard**. +The detail page renders both structures in the Mol* 3D viewer, colored by per-residue pLDDT confidence. Blue regions are predicted with high confidence; red and orange regions are disordered. -### 3. Target (Geometric Cavity Detection) -Using [**fpocket**](https://github.com/Discngine/fpocket), a Voronoi tessellation-based cavity detection algorithm, we scan the surface of the newly stabilized complex. We filter the results to isolate alpha-spheres that lie directly on the complex interface, ignoring irrelevant surface clefts. +Q55DI5 -### 4. Dock (Validation) -Finally, we perform high-throughput molecular docking of candidate compounds directly inside the identified pocket using [**AutoDock Vina**](https://github.com/ccsb-scripps/AutoDock-Vina). Promising binding affinities (kcal/mol) signify a high-confidence starting point for medicinal chemists. +### Gap Score Ranking + +Every protein in the results is ranked by an original Gap Score that answers the question: how urgently does the world need a drug for this target? The score combines structural confidence, drug coverage from ChEMBL, WHO priority pathogen status, and the disorder delta bonus. Results are sorted descending β€” the most urgently undrugged, high-confidence target appears first. The undrugged targets dashboard provides a pre-ranked leaderboard of the highest Gap Score complexes across the 20 most studied species. +Ranking + +### Binding Site Detection with fpocket + +When a researcher requests pocket analysis for a specific complex, ProtPocket runs fpocket on both the monomer and the homodimer structure files. fpocket identifies surface cavities using Voronoi tessellation and alpha sphere algorithms, returning each pocket with a druggability score, volume in cubic Γ…ngstrΓΆms, and the residues lining it. +Comparison + +By comparing the pocket lists from the monomer and dimer runs, ProtPocket identifies interface pockets β€” cavities that appear in the dimer but have no corresponding cavity in the monomer. These are pockets formed specifically by the coming together of two chains. They are cross-validated against the per-residue disorder delta: pockets lined by residues that gained structural confidence in the dimer are flagged as high-confidence interface pockets, the primary targets for PPI inhibitor programs. +Pocket Analysis + +### Fragment Suggestion from ChEMBL + +For each identified pocket, ProtPocket queries ChEMBL for small molecule fragments whose known binding pockets share geometric properties with the identified cavity β€” similar volume, similar hydrophobicity profile, similar charge distribution. The returned fragments are molecules that have been shown experimentally to bind structurally similar pockets in other proteins, providing a starting point for medicinal chemistry rather than an empty search space. +Fragments --- -## πŸ› οΈ Technical Stack & Implementation +## The Gap Score + +The Gap Score is ProtPocket's original drug target prioritization algorithm. It answers one question: given everything known about this protein complex, how urgently does research need a drug for it? -ProtPocket is built as a highly concurrent monorepo designed for speed and real-time visualization. +``` +Gap Score = pLDDT_norm Γ— undrugged_factor Γ— WHO_multiplier + disorder_bonus +``` -### Backend (Go) -- Designed with Go's `goroutines` to allow heavily parallelized live querying of the AlphaFold REST API, UniProt API, and ChEMBL API. -- Implements an intelligent, thread-safe memory Cache (`sync.RWMutex`) to prevent rate-limiting while serving the high-traffic Undrugged Target Leaderboard. -- **Framework**: `gofr.dev/pkg/gofr` -- **Scoring Engine**: Custom Go algorithms implementing the Disorder Gap Delta mathematical models. +**`pLDDT_norm`** is the AlphaFold dimer confidence score normalized to 0–1. A structurally unreliable prediction should not drive expensive drug discovery programs β€” this term ensures only well-predicted targets rank highly. -### Frontend (React / Vite / TailwindCSS) -- A dark-mode, futuristic UI inspired by modern biotech, utilizing Vanilla Tailwind CSS for lightning-fast performance without heavy component libraries. -- React Router DOM for instantaneous page transitions. +**`undrugged_factor`** is `1 - (drug_count / max_drug_count_in_dataset)`. When no approved drug targets the protein, this equals 1.0. As drug coverage increases the factor approaches 0, pushing well-covered targets to the bottom. This is the gap the algorithm is named for. -### Structural Visualization (Mol*) -- We embed [**Mol***](https://molstar.org/), an ultra-fast WebGL macromolecular viewer capable of streaming `.cif` trajectory data directly from AlphaFold servers. -- **Custom Implementations**: - - Live side-by-side comparative views of Monomeric vs. Complex topologies. - - On-the-fly pLDDT confidence coloring (`blue` = rigid interface, `red` = disordered chaos). - - Multi-pose trajectory viewing for AutoDock Vina binding conformation results. +**`WHO_multiplier`** applies a hard 2.0Γ— boost to proteins from WHO priority pathogens β€” the 19 bacteria and viruses the World Health Organization has designated as critical antimicrobial resistance threats. This reflects real-world clinical urgency. + +**`disorder_bonus`** adds `disorder_delta / 100` when the delta is positive. Proteins that undergo dramatic structural transformation in complex form represent the most scientifically novel entries in the March 2026 dataset. The bonus rewards them proportionally. --- -## πŸš€ Setup & Installation +## Binding Site Detection -### Prerequisites -- [Go (1.21+)](https://golang.org/dl/) -- [Node.js (18+)](https://nodejs.org/en/download/) +ProtPocket's pocket analysis pipeline operates on monomer and homodimer structure files and identifies druggable cavities through three stages. -### 1. Start the Go Backend -The backend engine serves the API endpoints (e.g., `/search`, `/complex`, `/undrugged`). +In the first stage, fpocket is invoked as a subprocess on both the monomer and dimer cif files. fpocket uses a rolling sphere algorithm β€” a probe sphere of variable radius is rolled across the molecular surface, and positions where the sphere is significantly surrounded by protein atoms are identified as potential pockets. Each pocket is scored for druggability based on its volume, shape, and chemical environment. -```bash -cd ProtPocket -go mod download -go run main.go -# The server will start on http://localhost:8000 -``` +In the second stage, the monomer and dimer pocket lists are compared geometrically. A pocket in the dimer that has no corresponding cavity within threshold distance in the monomer is identified as an interface pocket β€” it was created by the structural change induced by dimerization. Interface pockets are the primary targets of PPI inhibitor programs because a molecule binding there disrupts the protein-protein interaction itself rather than blocking a conventional enzymatic active site. -### 2. Start the React Frontend -The Vite dev server provides hot-module reloading for the UI. +In the third stage, each interface pocket is cross-referenced with per-residue pLDDT data from AlphaFold's confidence JSON files. Pockets whose lining residues gained the most structural confidence in the dimer β€” those with per-residue delta above threshold β€” are flagged as high-confidence interface pockets and sorted to the top of the ranked list. -```bash -cd ProtPocket/app -npm install -npm run dev -# The UI will load on http://localhost:5173 -``` +The Mol* viewer on the detail page highlights the identified pocket residues directly on the structure, allowing the researcher to visually inspect the cavity geometry and its relationship to the structural reveal. --- -## πŸ‘₯ Creators & Contributors -ProtPocket is proudly open-source and built for the global structural biology community. +## Data Sources + +**AlphaFold Database** (EMBL-EBI and Google DeepMind) provides all protein structure predictions. ProtPocket queries the search endpoint live for every request, recovering both monomer and homodimer predictions in a single call. + +**UniProt** provides protein identity β€” gene names, organism, taxonomy ID, disease associations, and reviewed annotation status. Every protein in ProtPocket has a UniProt accession as its canonical identifier, and all cross-database lookups originate from it. -- **Arshita Jaryal** - [GitHub](https://github.com/jaryalarshita) -- **Ayush Kumar** - [GitHub](https://github.com/ayush00git) -- **Divyansh Singh** - [GitHub](https://github.com/divyansh0x0) +**ChEMBL** (EMBL-EBI) provides drug-target association data. ProtPocket queries ChEMBL for approved drugs at Phase 4 clinical status and above targeting each protein. The resulting drug count feeds directly into the undrugged factor of the Gap Score. ChEMBL is also queried for fragment molecule suggestions matched to identified pocket geometries. + +**WHO Priority Pathogen List** (2024 edition) is hardcoded as a lookup table keyed by NCBI taxonomy ID. The list covers 24 bacterial and fungal pathogens designated as critical antimicrobial resistance threats and drives the 2Γ— multiplier in the Gap Score. + +**fpocket** runs locally as a subprocess. No external API is involved β€” structure files are downloaded, converted, analyzed, and the temporary files are deleted. fpocket is MIT licensed and freely available. + +**Open Babel** handles all molecular format conversions between stages β€” CIF to PDB for fpocket input, and format interconversion for fragment structures. + +ProtPocket does not store or redistribute AlphaFold structure files. All structure data is linked directly to EMBL-EBI's servers. All primary data sources are freely available under open licenses compatible with academic and commercial use. --- -## πŸ“š Scientific References & Attributions -This platform relies on the shoulders of giants. We heavily utilize data and tools from the following projects: +## Citation + +If you use ProtPocket in research, please cite the AlphaFold Database and the March 2026 complex release: + +> Fleming J. et al. AlphaFold Protein Structure Database and 3D-Beacons: New Data and Capabilities. *Journal of Molecular Biology* (2025). + +> EMBL-EBI, Google DeepMind, NVIDIA, Seoul National University. Millions of protein complexes added to AlphaFold Database. March 16, 2026. https://www.embl.org/news/science-technology/first-complexes-alphafold-database/ -1. **AlphaFold Protein Structure Database**: DeepMind, EMBL-EBI. [Website](https://alphafold.ebi.ac.uk/) -2. **UniProt**: The Universal Protein Resource. [Website](https://www.uniprot.org/) -3. **ChEMBL**: EMBL-EBI database of bioactive molecules with drug-like properties. [Website](https://www.ebi.ac.uk/chembl/) -4. **fpocket**: Open source protein cavity detection. [GitHub](https://github.com/Discngine/fpocket) -5. **AutoDock Vina**: Fast, accurate open-source molecular docking. [GitHub](https://github.com/ccsb-scripps/AutoDock-Vina) -6. **Mol***: A comprehensive web-based macromolecular visualization toolkit. [Website](https://molstar.org/) +The technical discovery of the AlphaFold complex API pipeline is documented in [COMPLEX.md](./COMPLEX.md) and may be cited independently. --- -
-

"The shapes of proteins are the locks of biology; we are searching for the keys."

-
diff --git a/app/src/components/complex/MoleculePicker.jsx b/app/src/components/complex/MoleculePicker.jsx index 9fc1198..b06af9d 100644 --- a/app/src/components/complex/MoleculePicker.jsx +++ b/app/src/components/complex/MoleculePicker.jsx @@ -33,7 +33,7 @@ export function MoleculePicker({ )} -
+
{isLoading && (
diff --git a/app/src/index.css b/app/src/index.css index beb5570..41fc8e7 100644 --- a/app/src/index.css +++ b/app/src/index.css @@ -116,3 +116,12 @@ color: var(--text-muted) !important; } +/* ═══ Hide native scrollbar ═══ */ +.scrollbar-hide { + -ms-overflow-style: none; /* IE/Edge */ + scrollbar-width: none; /* Firefox */ +} +.scrollbar-hide::-webkit-scrollbar { + display: none; /* Chrome/Safari/Opera */ +} + diff --git a/public/img/Q55DI5.png b/public/img/Q55DI5.png new file mode 100644 index 0000000..b859f83 Binary files /dev/null and b/public/img/Q55DI5.png differ diff --git a/public/img/comparison.png b/public/img/comparison.png new file mode 100644 index 0000000..25f5db4 Binary files /dev/null and b/public/img/comparison.png differ diff --git a/public/img/fragments.png b/public/img/fragments.png new file mode 100644 index 0000000..004632a Binary files /dev/null and b/public/img/fragments.png differ diff --git a/public/img/pocket-analysis.png b/public/img/pocket-analysis.png new file mode 100644 index 0000000..98983d6 Binary files /dev/null and b/public/img/pocket-analysis.png differ diff --git a/public/img/ranking.png b/public/img/ranking.png new file mode 100644 index 0000000..1b84bdc Binary files /dev/null and b/public/img/ranking.png differ