Comparative Analysis of Plant Co-Expression Networks in R (RComPlEx) is a Nextflow pipeline that identifies conserved co-expressologs which are orthologous genes that maintain coordinated expression patterns across evolutionarily divergent species.
Unlike traditional comparative genomics approaches that focus on sequence conservation alone, RComPlEx discovers genes that are:
- Orthologous - Share common evolutionary ancestry
- Co-expressed - Show correlated expression patterns within each species
- Functionally conserved - Maintain similar regulatory relationships across species
- Module members - Part of complete functional networks (cliques), not just pairwise connections
RComPlEx finds multi-species gene cliques where all pairwise relationships are conserved:
- Traditional approach: "Gene A and Gene B are co-expressed in both species"
- RComPlEx cliques: "Genes A, B, C, and D are all mutually co-expressed across all species" (6 pairwise relationships for 4 genes, all significant)
This ensures discovery of complete functional modules with tight coordination, providing stronger evidence of conserved biological function than pairwise comparisons alone.
- METHOD.md - Detailed algorithm explanation, statistical methods, hypergeometric testing
- INPUT_FORMAT.md - Data requirements, file formats, and validation
- INSTALLATION.md - Setup instructions for local, HPC, and container environments
- PROCESS_FLOW.txt - Step-by-step execution flow with biological context (includes unsigned and polarity analysis)
Do genes that are co-expressed in one species maintain those co-expression relationships in related species?
Conserved co-expression suggests:
- Shared regulatory mechanisms (transcription factors, enhancers)
- Functional coordination (pathway components, protein complexes)
- Evolutionary constraint (selection maintains co-regulation)
- Biological importance (essential processes preserved across species)
This implementation focuses on comparing annual vs. perennial grasses:
- Annual plants: Complete life cycle in one growing season (rapid growth, high reproduction)
- Perennial plants: Live multiple years (resource storage, stress tolerance, dormancy)
By stratifying cliques by life habit, RComPlEx identifies:
- Annual-specific modules: Genes coordinated uniquely in annual species
- Perennial-specific modules: Genes coordinated uniquely in perennial species
- Shared modules: Core processes conserved across all plants
Analyzes root and leaf tissues independently:
- Root: Nutrient uptake, water transport, stress response, storage
- Leaf: Photosynthesis, gas exchange, defense, transpiration
Tissue-specific cliques reveal organ-specialized regulatory networks.
- Tissue-specific analysis: Separate processing for root and leaf tissues
- Life habit stratification: Identifies annual-specific, perennial-specific, and shared cliques
- Multi-copy ortholog handling: Clique detection naturally handles many-to-many orthology
- Modular design: Can be run via convenient CLI or SLURM array jobs
- Centralized configuration: All parameters in a single YAML file
- Polarity divergence analysis: Compares signed vs. unsigned support to flag potential regulatory polarity changes
# Test with 3 pairs per tissue (15-30 min - recommended first!)
nextflow run main.nf -profile slurm --test_mode true
# Full pipeline - all tissues (2-3 hours)
nextflow run main.nf -profile slurm
# Single tissue only
nextflow run main.nf -profile slurm --tissues root
# Resume from interruption
nextflow run main.nf -profile slurm -resumeDefault paths (edit nextflow.config to change):
- Working directory:
/mnt/users/martpali/AnnualPerennial/RComPlEx - Output directory:
/mnt/users/martpali/AnnualPerennial/RComPlEx/results - Nextflow work:
/mnt/users/martpali/AnnualPerennial/RComPlEx/work
Override at runtime:
nextflow run main.nf -profile slurm \
--workdir /custom/path \
--outdir /custom/results \
-w /custom/workEdit config/pipeline_config.yaml to customize:
- Species lists (annual vs perennial)
- Tissues to analyze
- RComPlEx parameters (correlation method, network density, p-value threshold)
Edit nextflow.config for:
- Resource allocation (CPUs, memory, time limits)
- Directory paths
- SLURM settings
Located in results/{tissue}/:
coexpressolog_cliques_{tissue}_all.tsv- All cliques with full metadatacoexpressolog_cliques_{tissue}_annual.tsv- Annual-specific cliquescoexpressolog_cliques_{tissue}_perennial.tsv- Perennial-specific cliquescoexpressolog_cliques_{tissue}_shared.tsv- Mixed annual/perennial cliquesgenes_{tissue}_annual.txt- Gene lists for downstream analysisgenes_{tissue}_perennial.txtgenes_{tissue}_mixed.txt
Located in results/{tissue}/:
coexpressolog_cliques_unsigned_{tissue}_all.tsvcoexpressolog_cliques_unsigned_{tissue}_annual.tsvcoexpressolog_cliques_unsigned_{tissue}_perennial.tsvcoexpressolog_cliques_unsigned_{tissue}_shared.tsvgenes_unsigned_{tissue}_annual.txtgenes_unsigned_{tissue}_perennial.txtgenes_unsigned_{tissue}_mixed.txt
Located in results/{tissue}/polarity/:
polarity_divergence_<pair_id>.tsv– Columns:tissue, pair_id, gene1, gene2, score_signed, score_unsigned, polarity_divergentpolarity_divergent = TRUEwhen sign differs and unsigned strength > 75th percentile
CliqueID: Unique identifier (HOG_CliqueNumber)HOG: Hierarchical ortholog groupCliqueSize: Number of genes in cliqueGenes: Comma-separated gene IDsSpecies: Species represented in cliqueLifeHabit: Annual, Perennial, or Mixedn_annual_species: Number of annual speciesn_perennial_species: Number of perennial speciesn_species: Total species countMean_pval: Average p-value across all gene pairsMedian_pval: Median p-valueMean_effect_size: Average effect size (enrichment)n_edges: Number of co-expressed gene pairs
RComPlEx/
├── main.nf # Nextflow workflow (7 processes)
├── nextflow.config # Nextflow execution config
├── config/
│ └── pipeline_config.yaml # Central analysis parameters
├── R/
│ └── config_parser.R # Config utilities
├── scripts/
│ ├── prepare_single_pair.R # Pair data preparation
│ ├── rcomplex_01_load_filter.R # Step 1: Load & filter data
│ ├── rcomplex_02_compute_networks.R # Step 2: Correlation networks
│ ├── rcomplex_03_network_comparison.R # Step 3: Network comparison (signed & unsigned)
│ ├── rcomplex_04_summary_stats.R # Step 4: Summary statistics
│ └── find_coexpressolog_cliques.R # Clique detection from comparisons
├── slurm/
│ └── run_nextflow.sh # SLURM submission script for Nextflow
├── rcomplex_data/ # Working directories
│ ├── root/ # Root tissue analysis
│ └── leaf/ # Leaf tissue analysis
└── results/ # Final pipeline outputs
├── root/ # Root cliques & gene lists
└── leaf/ # Leaf cliques & gene lists
Purpose: Extract tissue-specific expression data for each species pair
- Loads variance-stabilized expression matrix (vst_hog.RDS)
- Filters to focal tissue (root or leaf)
- Extracts ortholog mappings from hierarchical ortholog groups (N1_clean.RDS)
- Creates species pair working directories
- Output: Expression files + ortholog table per pair
Purpose: Validate data and filter to analyzable gene sets
- Loads expression matrices for both species
- Filters orthologs to genes present in BOTH species' expression data
- Removes genes with missing values or low expression
- Output:
01_filtered_data.RData(clean, aligned ortholog pairs)
Purpose: Build gene co-expression networks for each species
Method:
-
Correlation calculation:
- Spearman correlation (default) - robust to outliers, detects monotonic relationships
- Computes all pairwise gene correlations (gene × gene matrix)
- Gene universe is limited to genes that have an ortholog in any other species (ortholog-wide filter applied before correlations)
-
Mutual Rank normalization:
- Ranks correlations bidirectionally: gene_i → gene_j AND gene_j → gene_i
- Combines ranks: MR = √(rank_ij × rank_ji)
- Why? Reduces spurious correlations, makes strengths comparable across species
-
Density thresholding:
- Keep top 3% of correlations (configurable)
- Creates sparse network of strongest co-expression
- Reduces noise and computational complexity
Parallelization: 2 species computed simultaneously (24 cores total)
Output: 02_networks.RData (correlation matrices + thresholds)
Additional: 02_networks_unsigned.RData (MR on absolute correlations) for polarity analysis
Purpose: Test whether co-expression relationships are conserved between species
For each ortholog pair (geneA_sp1, geneB_sp2):
-
Extract co-expression neighborhoods:
- Sp1_neighbors: All genes significantly co-expressed with geneA in Species 1
- Sp2_neighbors: All genes significantly co-expressed with geneB in Species 2
-
Test conservation bidirectionally:
Direction 1 (Sp1 → Sp2):
- Question: "Do geneA's neighbors in Sp1 have orthologs that are geneB's neighbors in Sp2?"
- Count overlap of conserved co-expression relationships
- Hypergeometric test: P(overlap ≥ observed | random expectation)
Direction 2 (Sp2 → Sp1):
- Question: "Do geneB's neighbors in Sp2 have orthologs that are geneA's neighbors in Sp1?"
- Ensures bidirectional conservation (not just one-way)
- Use minimum p-value (most conservative estimate)
-
Statistical testing:
Hypergeometric test: P(X ≥ k) where: k = observed overlap of conserved co-expression m = neighborhood size in species 1 n = non-neighbors in species 1 k = neighborhood size in species 2- Tests if overlap is greater than random chance
- Accounts for network structure and gene set sizes
-
Multiple testing correction:
- Benjamini-Hochberg FDR correction
- Controls false discovery rate at 5% (adjustable)
- Testing ~10,000 ortholog pairs requires stringent correction
Parallelization: Processes ortholog pairs in parallel (24 cores)
Output: 03_comparison.RData (p-values, effect sizes, neighborhood statistics)
Additional: 03_<pair_id>_unsigned.RData (unsigned comparison artifacts)
Purpose: Aggregate conservation statistics and generate diagnostic plots
- Computes per-gene conservation metrics (mean/median p-values)
- Computes per-HOG statistics (family-level conservation)
- Generates diagnostic visualizations:
- P-value correlation plot (bidirectional agreement)
- Effect size distribution (conservation signal strength)
Output: 04_summary_statistics.tsv + PNG plots
Purpose: Identify multi-species gene cliques where ALL pairwise co-expression is conserved
Algorithm:
-
Aggregate all pairwise comparisons for tissue:
- Combine results from ALL species pair analyses
- Filter to significantly conserved pairs (p < 0.05 after FDR)
-
For each Hierarchical Ortholog Group (HOG):
a. Build graph:
- Nodes = genes from all species in HOG
- Edges = conserved co-expression relationships
b. Find maximal cliques using
igraph::max_cliques():- Clique = complete subgraph where all nodes are connected to all others
- Maximal = cannot add more genes without losing completeness
- Example: 4-gene clique requires 6 conserved edges (all pairs)
c. Why cliques?
- Handles many-to-many orthology (paralogs → separate cliques)
- Ensures COMPLETE coordination (not partial)
- Stronger functional evidence than pairwise connections
- Multiple cliques per HOG = different functional modules
-
Annotate cliques:
- Species composition (which species represented)
- Life habit classification:
- Annual: All genes from annual species only
- Perennial: All genes from perennial species only
- Mixed: Contains both annual and perennial genes
- Statistical properties (mean p-values, effect sizes, edge counts)
-
Export stratified results:
- Separate files for annual, perennial, and mixed cliques
- Gene lists for downstream enrichment analysis
Output:
coexpressolog_cliques_{tissue}_{habit}.tsv(4 files: all, annual, perennial, shared)genes_{tissue}_{habit}.txt(gene lists for GO/pathway enrichment)
Unsigned path mirrors these outputs with coexpressolog_cliques_unsigned_* and genes_unsigned_*.
Statistical Rigor:
- Hypergeometric test models drawing without replacement (appropriate for networks)
- Bidirectional testing ensures mutual conservation
- FDR correction controls false discoveries in large-scale testing
Biological Relevance:
- Cliques represent tightly coordinated gene modules
- Conservation across species indicates functional importance
- Life habit stratification reveals adaptation-specific networks
- Tissue specificity captures organ-specialized regulation
Computational Efficiency:
- Parallelization across species pairs and ortholog pairs
- Sparse networks reduce memory and computation
- Nextflow enables resumable, scalable execution
-
vst_hog.RDS (253 MB): Variance-stabilized expression data
- Columns: species, tissue, GeneID, sample_id, vst.count, life_cycle, HOG
- Tissues: "root" and "leaf"
- 13 grass species (5 annual, 8 perennial)
-
N1_clean.RDS (6 MB): Hierarchical ortholog groups
- Columns: HOG, OG, species, GeneID, life_cycle, is_core
- Format: One row per gene-HOG association
Annual (5 species):
- Brachypodium distachyon
- Hordeum vulgare
- Vulpia bromoides
- Briza maxima
- Poa annua
Perennial (8 species):
- Brachypodium sylvaticum
- Hordeum jubatum
- Poa supina
- Briza media
- Festuca pratensis
- Melica nutans
- Nassella pubiflora
- Oloptum miliaceum
- Test mode: 15-30 minutes (3 pairs per tissue)
- Full pipeline: 2-3 hours (both tissues, 78 pairs each)
- Parallel execution: ~10 jobs simultaneously
- Resources per job: 24 CPUs, 200-400 GB RAM (adaptive)
- Intermediate files: 10-50 GB
- Final results: 1-5 GB
- Nextflow work: 50-200 GB (can be cleaned after successful run)
# Watch SLURM queue
watch -n 10 'squeue -u $USER'
# View logs
tail -f .nextflow.log
# Check disk space
df -h /mnt/users/martpali/AnnualPerennial/No cliques found: Check p-value threshold in config/pipeline_config.yaml (default: 0.05)
Memory errors: Pipeline automatically retries with increased memory (200 GB → 400 GB)
Jobs stuck in queue: Check SLURM partition limits with scontrol show partition orion
If you use this pipeline, please cite:
- ComPlEx method: Netotea et al. (2014) BMC Genomics. doi:10.1186/1471-2164-15-106
- R implementation: Torgeir R. Hvidsten
- Nextflow pipeline & parallelization: Martin Paliocha
- Pipeline repository: paliocha/RComPlEx-NF