Skip to content

KonstantinChekanow/bacteria_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bacteria Diversity Analysis Pipeline

The pipeline performs:

  1. Taxonomy resolution by the pytaxonkit;
  2. Pseudo-rarefaction using a multivariate hypergeometric distribution
  3. Alpha diversity calculation (with bootstrap CIs)
  4. Beta diversity analysis using Bray–Curtis dissimilarity
  5. Time-series visualisation of diversity metrics

The primary goal is to compare microbial diversity across time and between experimental conditions (e.g. pre-fish vs post-fish).


Environment

This project includes an environment.yml file that defines all required software dependencies.

This ensures that all collaborators use the same:

  • Python version
  • Library versions
  • Statistical implementations

To create the environment:

Inputs

1. Abundance Tables (*.csv, each file is a sample)

Each file should contain the columns:

  • tax_id
  • reference_length
  • average_depth_full_reference Each file name is the the sample ID.

2. Metadata (*.csv)

The metadata file should contain the following variables (columns):

  • sample_id (matching CSV filenames without extension)
  • collection_date
  • pre_or_post_fish
  • site
  • total_bases (total sequencing bases - for pseudorarefaction)

Installation required packages

Create and activate a conda environment.

Run tests

Running the Pipeline

Rarefaction Options

automatic threshold (--auto) and user-defined threshold (--threshold XXXXXX) [last case: e.g. XXXXXX 1000000000, where XXXXXX is the number of bases] Rarefaction uses a multivariate hypergeometric distribution (sampling without replacement), which preserves finite sequencing depth and prevents artificial inflation of low-abundance taxa.

Outputs

The pipeline generates the following outputs in the output directory:

Alpha Diversity

  • species_alpha_diversity.csv
  • genus_alpha_diversity.csv

Each file contains:

  • Mean alpha diversity metric across rarefaction replicates
  • 95% confidence intervals
  • Metadata merged for downstream plotting

Beta Diversity

Beta diversity (ordination + temporal kinetics)

This project computes beta diversity using multiple distance metrics and produces:

  • NMDS ordination (stress + PERMANOVA)
  • PCoA ordination (% variance explained + PERMANOVA)
  • Temporal kinetics per metric:
    • distance-to-baseline over time (site 1 and site 2 as dashed lines; mean across sites as solid line)
    • distance-to-previous over time (same line styling)
  • Temporal trajectory analysis in PCoA space (arrows over time + summary statistics)

Supported beta metrics

  • Bray–Curtis (braycurtis)
  • Jaccard presence/absence (jaccard)
  • Canberra (canberra)
  • Euclidean after Hellinger transform (euclidean_hellinger)

Inputs

  • A wide abundance matrix CSV (rows = sample_id, columns = taxa; values = bases/counts), e.g.:
    • diversity_out/species_matrix_mean.csv
    • diversity_out/genus_matrix_mean.csv
  • A metadata CSV with at least:
    • sample_id, site, collection_date, pre_or_post_fish

Abundance Matrices

  • species_matrix_mean.csv
  • genus_matrix_mean.csv

Replicate-mean abundance matrices suitable for downstream analysis.

Plots

All visualisations are saved in: /plots/

Plots include:

  • Violin plots comparing pre- vs post-fish diversity
  • Time-series alpha diversity plots (mean ± 95% CI)

Alpha Diversity Metrics

The following metrics are calculated:

  • Shannon index (primary diversity metric)
  • Simpson index (dominance-sensitive)
  • Observed OTUs (richness) # actually it is the number of observed taxa (species or genera)
  • Pielou’s evenness
  • Fisher’s alpha

Confidence intervals are estimated across rarefaction replicates.


Beta Diversity

Beta diversity is calculated using:

  • Bray–Curtis dissimilarity

Time-series analysis quantifies divergence from baseline samples, allowing assessment of ecological drift or treatment-induced shifts over time.


Testing

Unit tests are implemented using pytest.

To run tests: pytest

Unifrac analysis

UniFrac distances are computed based on the GTDB bacterial reference tree (bac120, release r226).

Two variants are calculated:

Weighted UniFrac – abundance-weighted phylogenetic dissimilarity

Unweighted UniFrac – presence/absence phylogenetic dissimilarity

###Mapping strategy

Input data contain NCBI tax_ids. For UniFrac computation:

NCBI tax_ids are resolved to species-level names.

Species names are mapped to GTDB representative genomes using bac120_metadata_r226.tsv.gz.

Abundances are aggregated to GTDB representative genome IDs.

Only taxa that successfully match tips in the GTDB tree (bac120_r226.tree.gz) are retained.

Higher-level assignments (e.g., family-level taxa) and ambiguous “uncultured bacterium” entries are excluded from the UniFrac analysis. A mapping report is generated indicating the fraction of total bases retained after phylogenetic mapping.

###Outputs

For both weighted and unweighted UniFrac, the pipeline produces:

-Distance matrices

-PCoA ordination plots

-Distance-to-baseline time-series plots

-Distance-to-previous time-series plots

-Temporal trajectory analysis (path length, net displacement)

Necessary tools

  • Python 3.11
  • numpy
  • pandas
  • scipy
  • scikit-bio
  • seaborn
  • matplotlib
  • pytaxonkit
  • pytest

Python commands fo running of the codes

###data processing + alpha-diversity + beta-diversity python bacteria/run_pipeline.py --input_folder relative_abundance_tables --metadata_file sample_metadata.csv --outdir diversity_out --threshold 5000000 --n_reps 50 --seed 1 --run_beta --beta_metrics braycurtis jaccard canberra euclidean_hellinger --permutations 999 --label_max_points 250

Argument Description --input_folder Folder containing per-sample CSV files (e.g., 2.csv, 3.csv, etc.). Each file must contain tax_id and either bases or (reference_length and average_depth_full_reference). --metadata_file CSV file with columns: sample_id, site, collection_date, pre_or_post_fish. --outdir Output directory where all results will be written. --threshold One of (--threshold or --auto) Integer Rarefaction depth (number of bases per sample after subsampling). Must be ≤ smallest sample total bases. --auto One of (--threshold or --auto) Flag Automatically sets rarefaction depth to the minimum total bases across samples. --n_reps Integer (default=50) Number of rarefaction replicates per sample. Used to compute mean and 95% CI for alpha diversity. --seed Integer (default=1) Random seed for reproducibility (rarefaction + NMDS). --run_beta If set, runs the full beta diversity analysis on rarefied mean matrices. --beta_metrics Space-separated beta diversity metrics. Supported: braycurtis, jaccard, canberra, euclidean_hellinger. --permutations Integer (default=999) Number of permutations used in PERMANOVA tests. --label_max_points Integer (default=250) Maximum number of ordination points labeled with collection dates (prevents overcrowding).

###UniFrac analysis [used only at the level of species] Argument Description bacteria/gtdb_unifrac.py --matrix_csv Species or genus matrix --metadata_csv Sample metadata --outdir Output directory --metrics Space-separated list of metrics --make_nmds Produce NMDS plots --make_pcoa Produce PCoA plots --make_timeseries Produce distance-over-time plots --make_trajectory Perform trajectory analysis

Argument Description --species_matrix_csv Species-level abundance matrix --metadata_csv Metadata CSV --gtdb_metadata_tsv_gz GTDB bac120 metadata file --gtdb_tree GTDB bac120 tree (.tree or .tree.gz) --outdir Output directory --no_weighted Skip weighted UniFrac --no_unweighted Skip unweighted UniFrac --chunk_size Chunk size for reading metadata --make_pcoa Generate PCoA plots

###Project structure graph bacteria_analysis/ │ ├── bacteria/ │ ├── run_pipeline.py │ ├── beta_diversity_pipeline.py │ ├── gtdb_unifrac.py │ ├── rarefaction.py │ ├── alpha.py │ └── ... │ ├── gtdb_r226/ │ ├── bac120_metadata_r226.tsv.gz │ └── bac120_r226.tree.gz │ ├── sample_metadata.csv ├── relative_abundance_tables/ └── README.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages