GitHub - KonstantinChekanow/bacteria_analysis

Bacteria Diversity Analysis Pipeline

The pipeline performs:

Taxonomy resolution by the pytaxonkit;
Pseudo-rarefaction using a multivariate hypergeometric distribution
Alpha diversity calculation (with bootstrap CIs)
Beta diversity analysis using Bray–Curtis dissimilarity
Time-series visualisation of diversity metrics

The primary goal is to compare microbial diversity across time and between experimental conditions (e.g. pre-fish vs post-fish).

Environment

This project includes an environment.yml file that defines all required software dependencies.

This ensures that all collaborators use the same:

Python version
Library versions
Statistical implementations

To create the environment:

Inputs

1. Abundance Tables (*.csv, each file is a sample)

Each file should contain the columns:

tax_id
reference_length
average_depth_full_reference Each file name is the the sample ID.

2. Metadata (*.csv)

The metadata file should contain the following variables (columns):

sample_id (matching CSV filenames without extension)
collection_date
pre_or_post_fish
site
total_bases (total sequencing bases - for pseudorarefaction)

Installation required packages

Create and activate a conda environment.

Run tests

Running the Pipeline

Rarefaction Options

automatic threshold (--auto) and user-defined threshold (--threshold XXXXXX) [last case: e.g. XXXXXX 1000000000, where XXXXXX is the number of bases] Rarefaction uses a multivariate hypergeometric distribution (sampling without replacement), which preserves finite sequencing depth and prevents artificial inflation of low-abundance taxa.

Outputs

The pipeline generates the following outputs in the output directory:

Alpha Diversity

species_alpha_diversity.csv
genus_alpha_diversity.csv

Each file contains:

Mean alpha diversity metric across rarefaction replicates
95% confidence intervals
Metadata merged for downstream plotting

Beta Diversity

Beta diversity (ordination + temporal kinetics)

This project computes beta diversity using multiple distance metrics and produces:

NMDS ordination (stress + PERMANOVA)
PCoA ordination (% variance explained + PERMANOVA)
Temporal kinetics per metric:
- distance-to-baseline over time (site 1 and site 2 as dashed lines; mean across sites as solid line)
- distance-to-previous over time (same line styling)
Temporal trajectory analysis in PCoA space (arrows over time + summary statistics)

Supported beta metrics

Bray–Curtis (braycurtis)
Jaccard presence/absence (jaccard)
Canberra (canberra)
Euclidean after Hellinger transform (euclidean_hellinger)

Inputs

A wide abundance matrix CSV (rows = sample_id, columns = taxa; values = bases/counts), e.g.:
- diversity_out/species_matrix_mean.csv
- diversity_out/genus_matrix_mean.csv
A metadata CSV with at least:
- sample_id, site, collection_date, pre_or_post_fish

Abundance Matrices

species_matrix_mean.csv
genus_matrix_mean.csv

Replicate-mean abundance matrices suitable for downstream analysis.

Plots

All visualisations are saved in: /plots/

Plots include:

Violin plots comparing pre- vs post-fish diversity
Time-series alpha diversity plots (mean ± 95% CI)

Alpha Diversity Metrics

The following metrics are calculated:

Shannon index (primary diversity metric)
Simpson index (dominance-sensitive)
Observed OTUs (richness) # actually it is the number of observed taxa (species or genera)
Pielou’s evenness
Fisher’s alpha

Confidence intervals are estimated across rarefaction replicates.

Beta Diversity

Beta diversity is calculated using:

Bray–Curtis dissimilarity

Time-series analysis quantifies divergence from baseline samples, allowing assessment of ecological drift or treatment-induced shifts over time.

Testing

Unit tests are implemented using pytest.

To run tests: pytest

Unifrac analysis

UniFrac distances are computed based on the GTDB bacterial reference tree (bac120, release r226).

Two variants are calculated:

Weighted UniFrac – abundance-weighted phylogenetic dissimilarity

Unweighted UniFrac – presence/absence phylogenetic dissimilarity

###Mapping strategy

Input data contain NCBI tax_ids. For UniFrac computation:

NCBI tax_ids are resolved to species-level names.

Species names are mapped to GTDB representative genomes using bac120_metadata_r226.tsv.gz.

Abundances are aggregated to GTDB representative genome IDs.

Only taxa that successfully match tips in the GTDB tree (bac120_r226.tree.gz) are retained.

Higher-level assignments (e.g., family-level taxa) and ambiguous “uncultured bacterium” entries are excluded from the UniFrac analysis. A mapping report is generated indicating the fraction of total bases retained after phylogenetic mapping.

###Outputs

For both weighted and unweighted UniFrac, the pipeline produces:

-Distance matrices

-PCoA ordination plots

-Distance-to-baseline time-series plots

-Distance-to-previous time-series plots

-Temporal trajectory analysis (path length, net displacement)

Necessary tools

Python 3.11
numpy
pandas
scipy
scikit-bio
seaborn
matplotlib
pytaxonkit
pytest

Python commands fo running of the codes

###data processing + alpha-diversity + beta-diversity python bacteria/run_pipeline.py --input_folder relative_abundance_tables --metadata_file sample_metadata.csv --outdir diversity_out --threshold 5000000 --n_reps 50 --seed 1 --run_beta --beta_metrics braycurtis jaccard canberra euclidean_hellinger --permutations 999 --label_max_points 250

Argument Description --input_folder Folder containing per-sample CSV files (e.g., 2.csv, 3.csv, etc.). Each file must contain tax_id and either bases or (reference_length and average_depth_full_reference). --metadata_file CSV file with columns: sample_id, site, collection_date, pre_or_post_fish. --outdir Output directory where all results will be written. --threshold One of (--threshold or --auto) Integer Rarefaction depth (number of bases per sample after subsampling). Must be ≤ smallest sample total bases. --auto One of (--threshold or --auto) Flag Automatically sets rarefaction depth to the minimum total bases across samples. --n_reps Integer (default=50) Number of rarefaction replicates per sample. Used to compute mean and 95% CI for alpha diversity. --seed Integer (default=1) Random seed for reproducibility (rarefaction + NMDS). --run_beta If set, runs the full beta diversity analysis on rarefied mean matrices. --beta_metrics Space-separated beta diversity metrics. Supported: braycurtis, jaccard, canberra, euclidean_hellinger. --permutations Integer (default=999) Number of permutations used in PERMANOVA tests. --label_max_points Integer (default=250) Maximum number of ordination points labeled with collection dates (prevents overcrowding).

###UniFrac analysis [used only at the level of species] Argument Description bacteria/gtdb_unifrac.py --matrix_csv Species or genus matrix --metadata_csv Sample metadata --outdir Output directory --metrics Space-separated list of metrics --make_nmds Produce NMDS plots --make_pcoa Produce PCoA plots --make_timeseries Produce distance-over-time plots --make_trajectory Perform trajectory analysis

Argument Description --species_matrix_csv Species-level abundance matrix --metadata_csv Metadata CSV --gtdb_metadata_tsv_gz GTDB bac120 metadata file --gtdb_tree GTDB bac120 tree (.tree or .tree.gz) --outdir Output directory --no_weighted Skip weighted UniFrac --no_unweighted Skip unweighted UniFrac --chunk_size Chunk size for reading metadata --make_pcoa Generate PCoA plots

###Project structure graph bacteria_analysis/ │ ├── bacteria/ │ ├── run_pipeline.py │ ├── beta_diversity_pipeline.py │ ├── gtdb_unifrac.py │ ├── rarefaction.py │ ├── alpha.py │ └── ... │ ├── gtdb_r226/ │ ├── bac120_metadata_r226.tsv.gz │ └── bac120_r226.tree.gz │ ├── sample_metadata.csv ├── relative_abundance_tables/ └── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
bacteria		bacteria
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Environment

Inputs

1. Abundance Tables (*.csv, each file is a sample)

2. Metadata (*.csv)

Installation required packages

Run tests

Running the Pipeline

Rarefaction Options

Outputs

Alpha Diversity

Beta Diversity

Beta diversity (ordination + temporal kinetics)

Supported beta metrics

Inputs

Abundance Matrices

Plots

Alpha Diversity Metrics

Beta Diversity

Testing

Unifrac analysis

Necessary tools

Python commands fo running of the codes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Environment

Inputs

1. Abundance Tables (*.csv, each file is a sample)

2. Metadata (*.csv)

Installation required packages

Run tests

Running the Pipeline

Rarefaction Options

Outputs

Alpha Diversity

Beta Diversity

Beta diversity (ordination + temporal kinetics)

Supported beta metrics

Inputs

Abundance Matrices

Plots

Alpha Diversity Metrics

Beta Diversity

Testing

Unifrac analysis

Necessary tools

Python commands fo running of the codes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages