This project presents the development of a profile Hidden Markov Model (HMM) for detecting the Kunitz-type serine protease inhibitor domain. Unlike traditional sequence-based models, this classifier is trained on a manually curated structure-based alignment to better capture conserved 3D features typical of the Kunitz fold. The model’s robustness was assessed via a two-fold cross-validation strategy and benchmarked using standard classification metrics, including the Matthews Correlation Coefficient (MCC).
The Kunitz domain (Pfam: PF00014) is a compact, cysteine-rich motif involved in the inhibition of serine proteases. Its characteristic fold, stabilized by three disulfide bridges, plays essential roles in blood coagulation, inflammation regulation, neuroprotection, and toxin function in various organisms. Due to high sequence variability, structural modeling provides an alternative route for more accurate domain detection.
This pipeline integrates a variety of command-line tools, Python packages, and web resources to build and evaluate a structural HMM for the Kunitz domain. Below is a complete list of software components and their respective purposes.
-
HMMER 3.3.2: Suite for building and querying profile Hidden Markov Models (HMMs).
hmmbuild: Trains the HMM from a multiple sequence alignment.hmmsearch: Queries protein sequences against the HMM.- Options
--maxand-Z 1000: Used to improve sensitivity and normalize score calibration.
-
PDBeFold: Used for structural alignment of PDB entries containing the Kunitz domain. The resulting
.alialignment file served as the input for HMM construction. -
CD-HIT: Reduces sequence redundancy by clustering protein sequences with ≥90% identity.
-
BLAST+: Filters out sequences highly similar to those used in training using
blastpandmakeblastdb. -
ChimeraX: Visualizes and structurally superimposes predicted false positives and negatives against a reference structure (e.g., BPTI, PDB ID: 3TGI).
-
WeBlogo / Skylign & Skylign: Generate sequence logos from the MSA used for HMM training.
Used for data preprocessing, evaluation metrics, and visualizations:
| Library | Purpose |
|---|---|
pandas |
Manage tabular data and read/write CSVs. |
numpy |
Perform numerical computations and array operations. |
matplotlib |
Generate and save plots (e.g., ROC curves, score distributions). |
seaborn |
Visualize confusion matrices and classification metrics. |
scikit-learn |
Compute performance metrics like precision, recall, AUC, and MCC. |
biopython |
Used (via Bio.SeqIO) for parsing FASTA/FASTQ and manipulating sequences. |
To install all Python requirements:
pip install pandas numpy matplotlib seaborn scikit-learn biopython- Pfam – PF00014: Reference for the Kunitz domain profile and seed alignments.
- InterPro: Used to confirm domain annotations and explore domain relationships.
- UniProt Downloads: Source of positive and negative sequences from Swiss-Prot in FASTA format.
- PDBeFold: Used again here for alignment-based domain extraction and comparison.
- Skylign / WebLogo: Visualization of conserved positions in the HMM alignment.
You can replicate the environment by running the following:
# Create and activate a dedicated conda environment
conda create -n hmm_kunitz python=3.10
conda activate hmm_kunitz
# Install core bioinformatics tools
conda install -c bioconda cd-hit
conda install -c bioconda hmmer
conda install -c bioconda blast-legacy
# Optional but recommended: Biopython
pip install biopythonFor a full walkthrough of the methodology with code, see the interactive notebook:
pipeline.ipynb
It contains data processing, model building, domain detection, and visualizations across validation folds.
Contains raw and processed input datasets, including positive and negative sequence sets
- clusters/: clustering results and source PDB metadata
- filtered/: filtered PDB FASTA and metadata files
- ids/:
cv_splits/: IDs for cross-validation folds (e.g.,pos_1.ids,neg_2.ids)filtering/: IDs to retain or exclude (to_keep.ids,to_remove.ids)random_sets/: random positive/negative subsetsreferences/: reference IDs (all_kunitz.id,sp.id)structural/: IDs of PDB sequences used in model training
- sequences/:
cross_validation/: FASTA files for cross-validationnegatives/: negative sequence FASTA (sp_negs.fasta)positives/: human and non-human Kunitz positivesraw/: unfiltered datasets from UniProt and Swiss-Prot
Stores intermediate files generated during the pipeline, in particular BLAST results. This folder represents key transitional steps that bridge initial data preprocessing and final evaluation.
- Alignment files:
pdb_kunitz_rp.ali,pdb_kunitz_rp_formatted.ali - Filtered datasets:
pdb_kunitz_rp.fasta,tmp_pdb_efold_ids.txt - Cross-validation results:
pos_1.out,neg_2.out,fn_pos2.txt - BLAST results:
pdb_kunitz_nr_23.blast -outfmt 7
Contains the structural HMM profile (.hmm) and related alignment files used for detection.
structural_model.hmm: trained HMM built withhmmbuildfrom structure-based alignment
Python scripts and notebooks used for sequence processing, evaluation, and plotting.
get_seq.py: extracts sequences from ID listsperformance.py: computes MCC, precision, recall, AUCroc_curve.ipynb: ROC curve visualizationmcc_vs_threshold_plot.ipynb: plots MCC vs E-valueconfusion_matrix.ipynb: confusion matrix plotting
Final output files such as performance tables, ROC curves, MCC plots, and confusion matrices.
classification/fold1/,fold2/: E-value prediction outputs for each fold (positive and negative sets).combined/: Aggregated results across both folds.
evaluation/- Contains performance summaries across thresholds
(e.g.,performance_set1_thresholds.txt,performance_set2_thresholds.txt).
- Contains performance summaries across thresholds
final_output/- Final visual and tabular outputs, such as:
confusion_matrix_set_1.pngresults_set_2.txt- ROC and MCC plots, etc.
- Final visual and tabular outputs, such as:
Contains all visual outputs generated during the analysis, also included in the final report:
- Confusion matrices and ROC curves
- MCC vs E-value threshold plots
- Sequence logos and structural superpositions
Includes the final PDF report with full methodology and discussion.
REPORT_LAB1_NATALE_SOFIA.pdf: Final course project report
pipeline.ipynb: notebook describing and executing the main steps.gitignore,.gitattributes: Git configuration filesREADME.md: project description (you are here)
This project highlights the strength of structure-guided profile HMMs in domain detection, especially for compact and divergent protein families like Kunitz. The model generalizes well, achieves perfect or near-perfect MCC, and avoids false positives even under realistic class imbalance. The integration of structure-based features offers a reliable strategy for high-throughput domain annotation and could be extended to other fold families where sequence similarity fails.
- Full project report:
REPORT LAB1_NATALE_SOFIA.pdf - GitHub repository: https://github.com/sofianatale/LAB1_project
See the full bibliography in the final report or access key tools used:
Author: Sofia Natale
Degree Program: MSc in Bioinformatics
University: University of Bologna
Course: Laboratory of Bioinformatics 1 - Module 2
Contact: sofia.natale@studio.unibo.it
This work was carried out as part of the LAB1 course project and is not intended for production use without further validation.