diff --git a/INSTALLATION.md b/INSTALLATION.md new file mode 100644 index 0000000..7325f67 --- /dev/null +++ b/INSTALLATION.md @@ -0,0 +1,31 @@ +## Installation + +### Prerequisites + +- Go 1.21+ +- Node.js 18+ +- fpocket (`sudo apt-get install fpocket` on Debian/Ubuntu, or build from [source](https://github.com/Discngine/fpocket)) +- Open Babel (`sudo apt-get install openbabel`) + +### Backend + +```bash +# Clone the repository +git clone https://github.com/ayush00git/ProtPocket.git +cd ProtPocket + +# Run the backend (port 8000) +go run main.go +``` + +### Frontend + +```bash +cd app +npm install +npm run dev +# Runs at localhost:5173 +# Proxies /api/* to localhost:8000 +``` + +--- diff --git a/README.md b/README.md index c461a0a..31054d8 100644 --- a/README.md +++ b/README.md @@ -1,125 +1,134 @@ -
- End-to-End Drug Lead Generation via the Disorder Delta
-
-### 4. Dock (Validation)
-Finally, we perform high-throughput molecular docking of candidate compounds directly inside the identified pocket using [**AutoDock Vina**](https://github.com/ccsb-scripps/AutoDock-Vina). Promising binding affinities (kcal/mol) signify a high-confidence starting point for medicinal chemists.
+### Gap Score Ranking
+
+Every protein in the results is ranked by an original Gap Score that answers the question: how urgently does the world need a drug for this target? The score combines structural confidence, drug coverage from ChEMBL, WHO priority pathogen status, and the disorder delta bonus. Results are sorted descending β the most urgently undrugged, high-confidence target appears first. The undrugged targets dashboard provides a pre-ranked leaderboard of the highest Gap Score complexes across the 20 most studied species.
+
+
+### Binding Site Detection with fpocket
+
+When a researcher requests pocket analysis for a specific complex, ProtPocket runs fpocket on both the monomer and the homodimer structure files. fpocket identifies surface cavities using Voronoi tessellation and alpha sphere algorithms, returning each pocket with a druggability score, volume in cubic Γ
ngstrΓΆms, and the residues lining it.
+
+
+By comparing the pocket lists from the monomer and dimer runs, ProtPocket identifies interface pockets β cavities that appear in the dimer but have no corresponding cavity in the monomer. These are pockets formed specifically by the coming together of two chains. They are cross-validated against the per-residue disorder delta: pockets lined by residues that gained structural confidence in the dimer are flagged as high-confidence interface pockets, the primary targets for PPI inhibitor programs.
+
+
+### Fragment Suggestion from ChEMBL
+
+For each identified pocket, ProtPocket queries ChEMBL for small molecule fragments whose known binding pockets share geometric properties with the identified cavity β similar volume, similar hydrophobicity profile, similar charge distribution. The returned fragments are molecules that have been shown experimentally to bind structurally similar pockets in other proteins, providing a starting point for medicinal chemistry rather than an empty search space.
+
---
-## π οΈ Technical Stack & Implementation
+## The Gap Score
+
+The Gap Score is ProtPocket's original drug target prioritization algorithm. It answers one question: given everything known about this protein complex, how urgently does research need a drug for it?
-ProtPocket is built as a highly concurrent monorepo designed for speed and real-time visualization.
+```
+Gap Score = pLDDT_norm Γ undrugged_factor Γ WHO_multiplier + disorder_bonus
+```
-### Backend (Go)
-- Designed with Go's `goroutines` to allow heavily parallelized live querying of the AlphaFold REST API, UniProt API, and ChEMBL API.
-- Implements an intelligent, thread-safe memory Cache (`sync.RWMutex`) to prevent rate-limiting while serving the high-traffic Undrugged Target Leaderboard.
-- **Framework**: `gofr.dev/pkg/gofr`
-- **Scoring Engine**: Custom Go algorithms implementing the Disorder Gap Delta mathematical models.
+**`pLDDT_norm`** is the AlphaFold dimer confidence score normalized to 0β1. A structurally unreliable prediction should not drive expensive drug discovery programs β this term ensures only well-predicted targets rank highly.
-### Frontend (React / Vite / TailwindCSS)
-- A dark-mode, futuristic UI inspired by modern biotech, utilizing Vanilla Tailwind CSS for lightning-fast performance without heavy component libraries.
-- React Router DOM for instantaneous page transitions.
+**`undrugged_factor`** is `1 - (drug_count / max_drug_count_in_dataset)`. When no approved drug targets the protein, this equals 1.0. As drug coverage increases the factor approaches 0, pushing well-covered targets to the bottom. This is the gap the algorithm is named for.
-### Structural Visualization (Mol*)
-- We embed [**Mol***](https://molstar.org/), an ultra-fast WebGL macromolecular viewer capable of streaming `.cif` trajectory data directly from AlphaFold servers.
-- **Custom Implementations**:
- - Live side-by-side comparative views of Monomeric vs. Complex topologies.
- - On-the-fly pLDDT confidence coloring (`blue` = rigid interface, `red` = disordered chaos).
- - Multi-pose trajectory viewing for AutoDock Vina binding conformation results.
+**`WHO_multiplier`** applies a hard 2.0Γ boost to proteins from WHO priority pathogens β the 19 bacteria and viruses the World Health Organization has designated as critical antimicrobial resistance threats. This reflects real-world clinical urgency.
+
+**`disorder_bonus`** adds `disorder_delta / 100` when the delta is positive. Proteins that undergo dramatic structural transformation in complex form represent the most scientifically novel entries in the March 2026 dataset. The bonus rewards them proportionally.
---
-## π Setup & Installation
+## Binding Site Detection
-### Prerequisites
-- [Go (1.21+)](https://golang.org/dl/)
-- [Node.js (18+)](https://nodejs.org/en/download/)
+ProtPocket's pocket analysis pipeline operates on monomer and homodimer structure files and identifies druggable cavities through three stages.
-### 1. Start the Go Backend
-The backend engine serves the API endpoints (e.g., `/search`, `/complex`, `/undrugged`).
+In the first stage, fpocket is invoked as a subprocess on both the monomer and dimer cif files. fpocket uses a rolling sphere algorithm β a probe sphere of variable radius is rolled across the molecular surface, and positions where the sphere is significantly surrounded by protein atoms are identified as potential pockets. Each pocket is scored for druggability based on its volume, shape, and chemical environment.
-```bash
-cd ProtPocket
-go mod download
-go run main.go
-# The server will start on http://localhost:8000
-```
+In the second stage, the monomer and dimer pocket lists are compared geometrically. A pocket in the dimer that has no corresponding cavity within threshold distance in the monomer is identified as an interface pocket β it was created by the structural change induced by dimerization. Interface pockets are the primary targets of PPI inhibitor programs because a molecule binding there disrupts the protein-protein interaction itself rather than blocking a conventional enzymatic active site.
-### 2. Start the React Frontend
-The Vite dev server provides hot-module reloading for the UI.
+In the third stage, each interface pocket is cross-referenced with per-residue pLDDT data from AlphaFold's confidence JSON files. Pockets whose lining residues gained the most structural confidence in the dimer β those with per-residue delta above threshold β are flagged as high-confidence interface pockets and sorted to the top of the ranked list.
-```bash
-cd ProtPocket/app
-npm install
-npm run dev
-# The UI will load on http://localhost:5173
-```
+The Mol* viewer on the detail page highlights the identified pocket residues directly on the structure, allowing the researcher to visually inspect the cavity geometry and its relationship to the structural reveal.
---
-## π₯ Creators & Contributors
-ProtPocket is proudly open-source and built for the global structural biology community.
+## Data Sources
+
+**AlphaFold Database** (EMBL-EBI and Google DeepMind) provides all protein structure predictions. ProtPocket queries the search endpoint live for every request, recovering both monomer and homodimer predictions in a single call.
+
+**UniProt** provides protein identity β gene names, organism, taxonomy ID, disease associations, and reviewed annotation status. Every protein in ProtPocket has a UniProt accession as its canonical identifier, and all cross-database lookups originate from it.
-- **Arshita Jaryal** - [GitHub](https://github.com/jaryalarshita)
-- **Ayush Kumar** - [GitHub](https://github.com/ayush00git)
-- **Divyansh Singh** - [GitHub](https://github.com/divyansh0x0)
+**ChEMBL** (EMBL-EBI) provides drug-target association data. ProtPocket queries ChEMBL for approved drugs at Phase 4 clinical status and above targeting each protein. The resulting drug count feeds directly into the undrugged factor of the Gap Score. ChEMBL is also queried for fragment molecule suggestions matched to identified pocket geometries.
+
+**WHO Priority Pathogen List** (2024 edition) is hardcoded as a lookup table keyed by NCBI taxonomy ID. The list covers 24 bacterial and fungal pathogens designated as critical antimicrobial resistance threats and drives the 2Γ multiplier in the Gap Score.
+
+**fpocket** runs locally as a subprocess. No external API is involved β structure files are downloaded, converted, analyzed, and the temporary files are deleted. fpocket is MIT licensed and freely available.
+
+**Open Babel** handles all molecular format conversions between stages β CIF to PDB for fpocket input, and format interconversion for fragment structures.
+
+ProtPocket does not store or redistribute AlphaFold structure files. All structure data is linked directly to EMBL-EBI's servers. All primary data sources are freely available under open licenses compatible with academic and commercial use.
---
-## π Scientific References & Attributions
-This platform relies on the shoulders of giants. We heavily utilize data and tools from the following projects:
+## Citation
+
+If you use ProtPocket in research, please cite the AlphaFold Database and the March 2026 complex release:
+
+> Fleming J. et al. AlphaFold Protein Structure Database and 3D-Beacons: New Data and Capabilities. *Journal of Molecular Biology* (2025).
+
+> EMBL-EBI, Google DeepMind, NVIDIA, Seoul National University. Millions of protein complexes added to AlphaFold Database. March 16, 2026. https://www.embl.org/news/science-technology/first-complexes-alphafold-database/
-1. **AlphaFold Protein Structure Database**: DeepMind, EMBL-EBI. [Website](https://alphafold.ebi.ac.uk/)
-2. **UniProt**: The Universal Protein Resource. [Website](https://www.uniprot.org/)
-3. **ChEMBL**: EMBL-EBI database of bioactive molecules with drug-like properties. [Website](https://www.ebi.ac.uk/chembl/)
-4. **fpocket**: Open source protein cavity detection. [GitHub](https://github.com/Discngine/fpocket)
-5. **AutoDock Vina**: Fast, accurate open-source molecular docking. [GitHub](https://github.com/ccsb-scripps/AutoDock-Vina)
-6. **Mol***: A comprehensive web-based macromolecular visualization toolkit. [Website](https://molstar.org/)
+The technical discovery of the AlphaFold complex API pipeline is documented in [COMPLEX.md](./COMPLEX.md) and may be cited independently.
---
-"The shapes of proteins are the locks of biology; we are searching for the keys."
-