Skip to content

Latest commit

 

History

History
592 lines (457 loc) · 15.8 KB

File metadata and controls

592 lines (457 loc) · 15.8 KB

GenoTools AI Development Guide

AI-Assisted Development Guide for GenoTools

This guide provides patterns, conventions, and requirements for AI coding assistants working on the GenoTools codebase.


Quick Reference

Before You Start

  1. Read this entire document
  2. Understand the pipeline architecture (CLI → QC classes → PLINK execution)
  3. Know the standard return dictionary format
  4. Verify external tool availability (PLINK, PLINK2, KING)

Critical Rules

  • Always validate inputs at the start of every method
  • Always return the standard output dictionary format
  • Never break the pipeline chain - methods must produce valid PLINK2 pfiles
  • Log everything via concat_logs()
  • Test with real PLINK files - the pipeline depends on external executables

Project Overview

GenoTools is a command-line tool for genotype quality control (QC) and ancestry prediction in genetic studies. It wraps PLINK/PLINK2 commands in a Python pipeline with ML-based ancestry inference.

Key Components

Component Purpose
genotools/pipeline.py CLI argument parsing, pipeline orchestration
genotools/qc.py SampleQC and VariantQC classes
genotools/ancestry.py Ancestry class for ML predictions
genotools/gwas.py Assoc class for GWAS/PCA
genotools/utils.py Shell execution, file conversion helpers
genotools/dependencies.py External tool management

Entry Points

# Main CLI
genotoolsgenotools.__main__:handle_main

# Reference download
genotools-downloadgenotools.download_refs:handle_download

Architecture

Pipeline Flow

Input (bfile/pfile/vcf)
    ↓
Format Conversion → PLINK2 pfiles
    ↓
Ancestry Prediction (optional)
    ↓
Split by Ancestry
    ↓
┌─────────────────────────────────────┐
│ QC Pipeline (per ancestry group)    │
│   callrate → sex → het → related →  │
│   case_control → haplotype → hwe →  │
│   geno → ld → assoc                 │
└─────────────────────────────────────┘
    ↓
JSON Output + Cleaned Files

Class Responsibilities

SampleQC - Sample-level quality control

  • run_callrate_prune() - Remove samples with low call rates
  • run_sex_prune() - Remove samples with sex discrepancies
  • run_het_prune() - Remove samples with extreme heterozygosity
  • run_related_prune() - Handle related/duplicate samples
  • run_confirming_kinship() - Verify family relationships

VariantQC - Variant-level quality control

  • run_geno_prune() - Remove variants with high missingness
  • run_case_control_prune() - Remove variants with case/control differences
  • run_haplotype_prune() - Remove haplotype-inconsistent variants
  • run_hwe_prune() - Remove Hardy-Weinberg violating variants
  • run_ld_prune() - Prune variants in linkage disequilibrium

Ancestry - Ancestry prediction

  • PCA calculation and projection
  • UMAP + XGBoost classification
  • Container/cloud inference support
  • Admixture handling

Assoc - Association analysis

  • PCA preparation and execution
  • GWAS execution
  • Lambda/inflation calculation

Code Conventions

Standard Return Dictionary

Every QC method must return this structure:

{
    'pass': bool,           # True if step completed successfully
    'step': str,            # Step identifier (e.g., 'callrate_prune')
    'metrics': {
        'outlier_count': int,  # Number of samples/variants pruned
        # ... other step-specific metrics
    },
    'output': {
        'pruned_samples': str,  # Path to pruned sample IDs (or None)
        'plink_out': str,       # Path to output pfiles (without extension)
        # ... other output files
    }
}

Example:

out_dict = {
    'pass': process_complete,
    'step': step,
    'metrics': metrics_dict,
    'output': outfiles_dict
}
return out_dict

Input Validation Pattern

Every method must validate inputs before processing:

def run_some_prune(self, threshold=0.05):
    geno_path = self.geno_path
    out_path = self.out_path

    # 1. Check paths are set
    if geno_path is None or out_path is None:
        raise ValueError("Both geno_path and out_path must be set before calling this method.")

    # 2. Check input files exist
    if not os.path.exists(f'{geno_path}.pgen'):
        raise FileNotFoundError(f"{geno_path} does not exist.")

    # 3. Check parameter types
    if not isinstance(threshold, (int, float)):
        raise TypeError("threshold should be of type int or float.")

    # 4. Check parameter bounds
    if threshold < 0 or threshold > 1:
        raise ValueError("threshold should be between 0 and 1.")

    # ... proceed with implementation

Shell Command Execution

Use shell_do() for all external commands:

from genotools.utils import shell_do, concat_logs

# Execute PLINK command
plink_cmd = f"{plink2_exec} --pfile {geno_path} --mind {mind} --make-pgen psam-cols=fid,parents,sex,pheno1,phenos --out {out_path}"
shell_do(plink_cmd)

# Always log after execution
listOfFiles = [f'{out_path}.log']
concat_logs(step, out_path, listOfFiles)

File Path Conventions

# Input/output paths never include extensions
geno_path = '/path/to/data'      # Files: data.pgen, data.pvar, data.psam
out_path = '/path/to/output'     # Will create: output.pgen, output.pvar, output.psam

# Intermediate files use step suffix
step_output = f'{out_path}_{step}'  # e.g., output_callrate_prune

# Outlier files
outliers_out = f'{out_path}.outliers'

PLINK2 psam Column Standard

Always preserve sample metadata columns:

# Use this flag for all --make-pgen commands
--make-pgen psam-cols=fid,parents,sex,pheno1,phenos

Testing

Current State

GenoTools does not have a formal test suite. Test data exists in /data/ but no pytest infrastructure.

When Adding Tests

Recommended structure:

tests/
├── conftest.py           # Fixtures for test data paths
├── test_sample_qc.py     # SampleQC method tests
├── test_variant_qc.py    # VariantQC method tests
├── test_ancestry.py      # Ancestry prediction tests
├── test_pipeline.py      # Integration tests
└── test_utils.py         # Utility function tests

Test pattern:

import pytest
from genotools.qc import SampleQC

class TestSampleQC:
    """Tests for SampleQC class."""

    def test_callrate_prune_valid_input(self, sample_pfiles):
        """Test callrate pruning with valid input."""
        # Arrange
        qc = SampleQC(geno_path=sample_pfiles, out_path='/tmp/test_out')

        # Act
        result = qc.run_callrate_prune(mind=0.02)

        # Assert
        assert result['pass'] is True
        assert result['step'] == 'callrate_prune'
        assert 'outlier_count' in result['metrics']

    def test_callrate_prune_invalid_mind(self, sample_pfiles):
        """Test that invalid mind value raises ValueError."""
        qc = SampleQC(geno_path=sample_pfiles, out_path='/tmp/test_out')

        with pytest.raises(ValueError, match="mind should be between 0 and 1"):
            qc.run_callrate_prune(mind=1.5)

Common Pitfalls

1. Missing await/async

GenoTools is synchronous. Do not add async/await unless refactoring the entire pipeline.

2. Breaking the pfile chain

Every QC step must:

  • Read from {geno_path}.pgen/.pvar/.psam
  • Write to {out_path}.pgen/.pvar/.psam
# ❌ BAD - Breaks the chain
result = run_analysis(geno_path)
return result  # No pfiles created

# ✅ GOOD - Maintains the chain
plink_cmd = f"{plink2_exec} --pfile {geno_path} ... --make-pgen ... --out {out_path}"
shell_do(plink_cmd)
# Now out_path.pgen exists for next step

3. Forgetting to log

# ❌ BAD - No logging
shell_do(plink_cmd)
return out_dict

# ✅ GOOD - Always log
shell_do(plink_cmd)
listOfFiles = [f'{out_path}.log']
concat_logs(step, out_path, listOfFiles)
return out_dict

4. Incorrect outlier file format

# Outlier files must be tab-separated with #FID header
# ❌ BAD
df.to_csv(outliers_out, sep=',')

# ✅ GOOD
df = df.rename({'FID': '#FID'}, axis=1)
df.to_csv(outliers_out, sep='\t', header=True, index=False)

5. Platform-specific code without checks

# ❌ BAD - KING only works on Linux
result = run_king_analysis()

# ✅ GOOD - Check platform first
import platform
if platform.system() != 'Linux':
    print('This analysis can only run on Linux!')
    return None
result = run_king_analysis()

Adding New QC Steps

Template for Sample-Level QC

def run_new_sample_prune(self, param1=default1, param2=default2):
    """
    Execute new sample pruning on genotype data.

    Parameters:
    - param1 (type): Description. Default is default1.
    - param2 (type): Description. Default is default2.

    Returns:
    - dict: Standard output dictionary with 'pass', 'step', 'metrics', 'output'.
    """
    geno_path = self.geno_path
    out_path = self.out_path

    # Input validation
    if geno_path is None or out_path is None:
        raise ValueError("Both geno_path and out_path must be set.")
    if not os.path.exists(f'{geno_path}.pgen'):
        raise FileNotFoundError(f"{geno_path} does not exist.")
    # ... validate param1, param2

    step = "new_sample_prune"
    outliers_out = f'{out_path}.outliers'

    # Execute PLINK commands
    plink_cmd = f"{plink2_exec} --pfile {geno_path} ... --make-pgen psam-cols=fid,parents,sex,pheno1,phenos --out {out_path}"
    shell_do(plink_cmd)

    # Log
    listOfFiles = [f'{out_path}.log']
    concat_logs(step, out_path, listOfFiles)

    # Process results and count outliers
    if os.path.isfile(f'{out_path}.some_output'):
        outliers = pd.read_csv(f'{out_path}.some_output', sep='\s+')
        outliers = outliers.rename({'FID': '#FID'}, axis=1)
        outliers.to_csv(outliers_out, sep='\t', header=True, index=False)
        outlier_count = outliers.shape[0]
    else:
        outlier_count = 0

    process_complete = True

    # Build return dictionary
    outfiles_dict = {
        'pruned_samples': outliers_out,
        'plink_out': out_path,
    }
    metrics_dict = {
        'outlier_count': outlier_count
    }
    out_dict = {
        'pass': process_complete,
        'step': step,
        'metrics': metrics_dict,
        'output': outfiles_dict
    }

    return out_dict

Registering New Steps in Pipeline

After adding a new method, update pipeline.py:

# In handle_main() or execute_pipeline()

# 1. Add to appropriate step list
samp_steps = ['callrate', 'sex', 'het', 'related', 'kinship_check', 'new_step']

# 2. Add to steps_dict mapping
steps_dict = {
    # ... existing steps
    'new_step': samp_qc.run_new_sample_prune,
}

# 3. Add argument in gt_argparse()
parser.add_argument('--new_step', type=float, nargs='?', default=None, const=0.05,
                    help='Description of new step')

External Dependencies

Required Executables

  • PLINK 1.9 - Legacy format support, sex checks
  • PLINK2 - Primary tool for QC operations
  • KING - Relatedness analysis (Linux only)

Checking Dependencies

from genotools.dependencies import check_plink, check_plink2, check_king

plink_exec = check_plink()    # Returns path or downloads
plink2_exec = check_plink2()  # Returns path or downloads
king_exec = check_king()      # Returns path (Linux) or None

Python Dependencies

pandas, numpy              # Data manipulation
scikit-learn, xgboost     # ML models
umap-learn==0.5.3         # Dimensionality reduction (pinned version)
scipy, statsmodels        # Statistics
matplotlib, seaborn       # Visualization
google-cloud-aiplatform   # Cloud predictions

File Formats

Input Formats

  • bfile: PLINK 1.9 (.bed, .bim, .fam)
  • pfile: PLINK 2 (.pgen, .pvar, .psam)
  • VCF: Variant Call Format (.vcf)

Internal Format

All processing uses PLINK2 pfiles. Conversion happens automatically at pipeline start.

Required psam Columns

#FID    IID    SEX    PHENO1    [additional columns]
  • SEX: 0=unknown, 1=male, 2=female
  • PHENO1: -9=missing, 1=control, 2=case

Output JSON Structure

{
    "ancestry_counts": {"EUR": 100, "AFR": 50, ...},
    "ancestry_labels": [{"#FID": "...", "IID": "...", "label": "EUR"}, ...],
    "QC": [
        {"step": "callrate_prune", "pruned_count": 5, "metric": "outlier_count", ...}
    ],
    "GWAS": [
        {"value": 1.02, "metric": "lambda", "ancestry": "EUR"}
    ],
    "pruned_samples": [...],
    "related_samples": [...]
}

Ancestry Prediction

Model Architecture

  • Step 1: PCA on merged reference + input samples
  • Step 2: UMAP transformation
  • Step 3: XGBoost classification
  • Supported labels: AFR, SAS, EAS, EUR, AMR, AJ, CAS, MDE, FIN, AAC

Container Support

# Docker
genotools --pfile input --out output --ancestry --container

# Singularity
genotools --pfile input --out output --ancestry --singularity

# Google Cloud
genotools --pfile input --out output --ancestry --cloud

Development Workflow

Virtual Environment Setup

Use separate virtual environments to keep stable and development versions isolated:

Environment Install Command Purpose
.venv pip install -e . Active development, code changes reflected immediately
.venv-stable pip install . Frozen baseline snapshot for regression comparison
# Development environment (editable - use this for active work)
python -m venv .venv
source .venv/bin/activate
pip install -e .

# Stable baseline (frozen snapshot - use for comparison)
python -m venv .venv-stable
source .venv-stable/bin/activate
pip install .

Note: Both are installed from the local repo, not PyPI, since PyPI may be outdated.

1. Making Changes

# Install in development mode
pip install -e .

# Run linting (recommended)
flake8 genotools/
black genotools/

# Test manually
genotools --pfile test_data --out test_output --callrate

2. Testing Changes

# Basic pipeline test
genotools --pfile data/test --out output/test --all_sample --all_variant

# With ancestry
genotools --pfile data/test --out output/test --ancestry --ref_panel /path/to/ref --ref_labels /path/to/labels

3. Committing

# Version bump in setup.py if needed
# Update CITATION.cff if adding authors
git add .
git commit -m "Description of changes"

Error Handling Checklist

When AI generates code, verify:

  • All inputs validated at method start
  • FileNotFoundError raised for missing files
  • ValueError raised for invalid parameters
  • TypeError raised for wrong parameter types
  • Platform checks for OS-specific features
  • Proper error messages guide users to logs
  • --warn flag behavior respected (continue on failure)

Quick Debugging

Check PLINK availability

from genotools.dependencies import check_plink, check_plink2
print(check_plink())   # Should print path
print(check_plink2())  # Should print path

Inspect pipeline state

# After pipeline runs, check pass_fail dict
for step, status in out_dict['pass_fail'].items():
    print(f"{step}: {'PASS' if status['status'] else 'FAIL'}")

Read logs

cat output_all_logs.log      # Full PLINK output
cat output_cleaned_logs.log  # Formatted summary

Summary

Rule Reason
Validate all inputs Catch errors early with clear messages
Use standard return dict Pipeline depends on consistent structure
Always log with concat_logs() Debugging and audit trail
Maintain pfile chain Each step feeds the next
Check platform for KING Only works on Linux
Pin umap-learn==0.5.3 Model compatibility