PhenoQC is a lightweight, efficient, and user-friendly toolkit designed to perform comprehensive quality control (QC) on phenotypic datasets. It ensures that data adheres to standardized formats, maintains consistency, and is harmonized with recognized ontologies—facilitating seamless integration with genomic data for advanced research.
-
Comprehensive Data Validation:
Checks format compliance, schema adherence, and data consistency against JSON schemas. -
Ontology Mapping:
Maps phenotypic terms to standardized ontologies (HPO, DO, MPO) with synonym resolution and optional custom mappings. -
Missing Data Handling:
Detects and optionally imputes missing data (e.g., mean, median, mode, KNN, MICE, SVD) or flags records for manual review. -
Batch Processing:
Processes multiple files simultaneously in parallel, streamlining large-scale data QC. -
User-Friendly Interfaces:
Provides a command-line interface (CLI) for power users and a Streamlit-based GUI for interactive workflows. -
Reporting and Visualization:
Generates detailed QC reports (PDF or Markdown) and produces visual summaries of data quality metrics. -
Extensibility:
Modular design supports easy customization of validation rules, mapping expansions, or new ontologies. -
Class Distribution (Optional):
Provide a label column to get a class-imbalance summary and warning if the minority proportion falls below a threshold. -
Strategy‑agnostic Imputation & Tuning:
Configure global strategy and params (mean, median, mode, knn, mice, svd, none), per‑column overrides, and optional mask‑and‑score tuning.
- Installation
- Quick Start
- CLI
- GUI
- Reports
- Examples and Scripts
- Configuration
- Troubleshooting
- Contributing
PhenoQC requires Python 3.9+.
pip install phenoqcgit clone https://github.com/jorgeMFS/PhenoQC.git
cd PhenoQC
pip install -e .For local development without installation you can run:
python -m phenoqc.cliDependencies are listed in requirements.txt.
phenoqc --help
# Minimal run
phenoqc \
--input examples/samples/sample_data.csv \
--schema examples/schemas/pheno_schema.json \
--config config.yaml \
--unique_identifiers SampleID \
--output ./reports/Enable class distribution and imputation tuning:
phenoqc \
--input data.csv \
--schema schema.json \
--config config.yaml \
--unique_identifiers SampleID \
--label-column class --imbalance-threshold 0.10 \
--impute-params '{"n_neighbors": 5}' --impute-tuning on \
--output ./reports/PhenoQC provides a flexible command-line interface suited for automation.
phenoqc \
--input examples/samples/sample_data.json \
--output ./reports/ \
--schema examples/schemas/pheno_schema.json \
--config config.yaml \
--custom_mappings examples/mapping/custom_mappings.json \
--impute mice \
--unique_identifiers SampleID \
--phenotype_columns '{"PrimaryPhenotype": ["HPO"], "DiseaseCode": ["DO"]}' \
--ontologies HPO DOphenoqc \
--input examples/samples/sample_data.csv examples/samples/sample_data.json examples/samples/sample_data.tsv \
--output ./reports/ \
--schema examples/schemas/pheno_schema.json \
--config config.yaml \
--impute none \
--unique_identifiers SampleID \
--ontologies HPO DO MPO \
--phenotype_columns '{"PrimaryPhenotype": ["HPO"], "DiseaseCode": ["DO"], "TertiaryPhenotype": ["MPO"]}'--impute-params '{"n_neighbors": 5}'(JSON)--impute-tuning on|off--label-column classand--imbalance-threshold 0.10--quality-metrics imputation_bias redundancy(orall) (alias:--metrics)- Imputation-bias thresholds:
--bias-smd-threshold,--bias-var-low,--bias-var-high,--bias-ks-alpha - Categorical bias thresholds:
--bias-psi-threshold,--bias-cramer-threshold - Imputation stability diagnostics:
--impute-diagnostics on|off,--diag-repeats,--diag-mask-fraction,--diag-scoring - Stability fail threshold:
--stability-cv-fail-threshold(fail run if average CV exceeds value) - Protected columns:
--protected-columns label outcome - Redundancy:
--redundancy-threshold,--redundancy-method {pearson,spearman} - Offline/caching:
--offlineforces cached/local ontologies only; no downloads. In online mode, ontology downloads use retry/backoff and are cached under~/.phenoqc/ontologiesrespectingcache_expiry_daysin config.
Reports generated under --output include a PDF with:
- Summary & scores
- Optional Class Distribution (when label column is set)
- Additional Quality Dimensions (only when computed)
- Missing data summary, mapping success, and visuals
- CLI flags override values in the YAML config for the run.
- If you set
--impute-paramsor enable--impute-tuning on, these take precedence overimputation.params/imputation.tuninginconfig.yaml. - Same precedence applies to diagnostics thresholds and redundancy settings.
imputation:
strategy: knn
params:
n_neighbors: 5
weights: uniform
per_column:
Creatinine_mgdl:
strategy: mice
params: {max_iter: 15}
Cholesterol_mgdl:
strategy: svd
params: {rank: 3}
tuning:
enable: true
mask_fraction: 0.1
scoring: MAE
max_cells: 20000
random_state: 42
grid:
n_neighbors: [3, 5, 7]Launch the Streamlit interface:
# Local
python run_gui.py
# Streamlit Community Cloud
# In the deploy UI, set the entrypoint to `streamlit_app.py`Workflow:
- Step 3: Optional label column and imbalance threshold
- Step 4: Default strategy, per‑column overrides, parameters, and tuning (strategy‑agnostic)
- Step 4 includes: Bias thresholds, Stability diagnostics (enable, repeats, mask fraction, scoring), Protected columns, and Redundancy settings
- Results: Class Distribution table/plot, Imputation Settings, Imputation Stability & Bias, Tuning Summary
- Bias includes numeric (SMD, variance ratio, KS) and categorical (PSI, Cramér’s V) metrics; report shows which rules triggered per variable
- Optional Multiple Imputation Uncertainty (MICE repeats) table
- Class Distribution: table and warning when minority proportion < threshold
- Imputation Settings: global strategy/params and tuning summary
- Imputation Stability & Bias: per-variable stability (repeatability) and bias diagnostics with thresholds and triggers
- Additional Quality: only displayed if metrics are computed
scripts/e2e_small_quality_metrics_cli_test.py– small demo focusing on quality metricsscripts/e2e_medium_cli_test.py– mid-sized end-to-end pipeline runscripts/end_to_end_e2e_cli_test.py– large end-to-end pipeline runscripts/clinical_all_features_e2e.py– comprehensive clinical exam: full dataset/schema/config/custom mappings, online+offline, per‑column imputation coveragescripts/imputation_params_cli_test.py– imputation params and optional tuningscripts/end_to_end_diagnostics_demo.sh– end-to-end example enabling stability & bias diagnostics with CLI overrides
PhenoQC relies on a YAML config file (e.g., config.yaml) to define ontologies, fuzzy matching thresholds, caching, and imputation defaults.
Sample config.yaml:
ontologies:
HPO:
name: Human Phenotype Ontology
source: url
url: http://purl.obolibrary.org/obo/hp.obo
format: obo
DO:
name: Disease Ontology
source: url
url: http://purl.obolibrary.org/obo/doid.obo
format: obo
MPO:
name: Mammalian Phenotype Ontology
source: url
url: http://purl.obolibrary.org/obo/mp.obo
format: obo
default_ontologies:
- HPO
- DO
- MPO
fuzzy_threshold: 80
cache_expiry_days: 30
# optional: offline (forces cache/local ontologies only for the run)
# offline: true
quality_metrics:
redundancy: { enable: true }
imputation_bias: { enable: true }
imputation_stability: { enable: true, repeats: 5, mask_fraction: 0.10, scoring: MAE }
class_distribution:
label_column: class
warn_threshold: 0.10
imputation_bias:
smd_threshold: 0.10
var_ratio_low: 0.5
var_ratio_high: 2.0
ks_alpha: 0.05
psi_threshold: 0.10
cramer_threshold: 0.20
imputation:
strategy: knn
params:
n_neighbors: 5
weights: uniform
per_column:
Creatinine_mgdl:
strategy: mice
params:
max_iter: 15
Cholesterol_mgdl:
strategy: svd
params:
rank: 3
tuning:
enable: true
mask_fraction: 0.1
scoring: MAE
max_cells: 20000
random_state: 42
grid:
n_neighbors: [3, 5, 7]Note: Labels are never modified and are excluded from the imputation matrix when a label_column is provided.
- Ontology Mapping Failures: Check if
config.yamlpoints to valid ontology URLs or local files. - Missing Required Columns: Ensure columns specified as unique identifiers or phenotypic columns exist in the dataset.
- Imputation Errors: Some strategies (e.g.,
mean) only apply to numeric columns. - Logs: Consult the
phenoqc_*.logfile for in-depth error messages.
- Fork the repository on GitHub.
- Create a branch, implement changes, and add tests or documentation as appropriate.
- Open a Pull Request describing your contribution.
We welcome improvements that enhance PhenoQC's functionality or documentation.
Distributed under the MIT License.
Maintainer:
Jorge Miguel Ferreira da Silva
jorge(dot)miguel(dot)ferreira(dot)silva(at)ua(dot)pt
For more details, see the GitHub Wiki or open an issue on GitHub.
Last updated: August 10, 2025.