Bacteria Diversity Analysis Pipeline
The pipeline performs:
- Taxonomy resolution by the pytaxonkit;
- Pseudo-rarefaction using a multivariate hypergeometric distribution
- Alpha diversity calculation (with bootstrap CIs)
- Beta diversity analysis using Bray–Curtis dissimilarity
- Time-series visualisation of diversity metrics
The primary goal is to compare microbial diversity across time and between experimental conditions (e.g. pre-fish vs post-fish).
This project includes an environment.yml file that defines all required software dependencies.
This ensures that all collaborators use the same:
- Python version
- Library versions
- Statistical implementations
To create the environment:
Each file should contain the columns:
tax_idreference_lengthaverage_depth_full_referenceEach file name is the the sample ID.
The metadata file should contain the following variables (columns):
sample_id(matching CSV filenames without extension)collection_datepre_or_post_fishsitetotal_bases(total sequencing bases - for pseudorarefaction)
Create and activate a conda environment.
automatic threshold (--auto) and user-defined threshold (--threshold XXXXXX) [last case: e.g. XXXXXX 1000000000, where XXXXXX is the number of bases] Rarefaction uses a multivariate hypergeometric distribution (sampling without replacement), which preserves finite sequencing depth and prevents artificial inflation of low-abundance taxa.
The pipeline generates the following outputs in the output directory:
species_alpha_diversity.csvgenus_alpha_diversity.csv
Each file contains:
- Mean alpha diversity metric across rarefaction replicates
- 95% confidence intervals
- Metadata merged for downstream plotting
This project computes beta diversity using multiple distance metrics and produces:
- NMDS ordination (stress + PERMANOVA)
- PCoA ordination (% variance explained + PERMANOVA)
- Temporal kinetics per metric:
- distance-to-baseline over time (site 1 and site 2 as dashed lines; mean across sites as solid line)
- distance-to-previous over time (same line styling)
- Temporal trajectory analysis in PCoA space (arrows over time + summary statistics)
- Bray–Curtis (
braycurtis) - Jaccard presence/absence (
jaccard) - Canberra (
canberra) - Euclidean after Hellinger transform (
euclidean_hellinger)
- A wide abundance matrix CSV (rows = sample_id, columns = taxa; values = bases/counts), e.g.:
diversity_out/species_matrix_mean.csvdiversity_out/genus_matrix_mean.csv
- A metadata CSV with at least:
sample_id,site,collection_date,pre_or_post_fish
species_matrix_mean.csvgenus_matrix_mean.csv
Replicate-mean abundance matrices suitable for downstream analysis.
All visualisations are saved in: /plots/
Plots include:
- Violin plots comparing pre- vs post-fish diversity
- Time-series alpha diversity plots (mean ± 95% CI)
The following metrics are calculated:
- Shannon index (primary diversity metric)
- Simpson index (dominance-sensitive)
- Observed OTUs (richness) # actually it is the number of observed taxa (species or genera)
- Pielou’s evenness
- Fisher’s alpha
Confidence intervals are estimated across rarefaction replicates.
Beta diversity is calculated using:
- Bray–Curtis dissimilarity
Time-series analysis quantifies divergence from baseline samples, allowing assessment of ecological drift or treatment-induced shifts over time.
Unit tests are implemented using pytest.
To run tests: pytest
UniFrac distances are computed based on the GTDB bacterial reference tree (bac120, release r226).
Two variants are calculated:
Weighted UniFrac – abundance-weighted phylogenetic dissimilarity
Unweighted UniFrac – presence/absence phylogenetic dissimilarity
###Mapping strategy
Input data contain NCBI tax_ids. For UniFrac computation:
NCBI tax_ids are resolved to species-level names.
Species names are mapped to GTDB representative genomes using bac120_metadata_r226.tsv.gz.
Abundances are aggregated to GTDB representative genome IDs.
Only taxa that successfully match tips in the GTDB tree (bac120_r226.tree.gz) are retained.
Higher-level assignments (e.g., family-level taxa) and ambiguous “uncultured bacterium” entries are excluded from the UniFrac analysis. A mapping report is generated indicating the fraction of total bases retained after phylogenetic mapping.
###Outputs
For both weighted and unweighted UniFrac, the pipeline produces:
-Distance matrices
-PCoA ordination plots
-Distance-to-baseline time-series plots
-Distance-to-previous time-series plots
-Temporal trajectory analysis (path length, net displacement)
- Python 3.11
- numpy
- pandas
- scipy
- scikit-bio
- seaborn
- matplotlib
- pytaxonkit
- pytest
###data processing + alpha-diversity + beta-diversity python bacteria/run_pipeline.py --input_folder relative_abundance_tables --metadata_file sample_metadata.csv --outdir diversity_out --threshold 5000000 --n_reps 50 --seed 1 --run_beta --beta_metrics braycurtis jaccard canberra euclidean_hellinger --permutations 999 --label_max_points 250
Argument Description --input_folder Folder containing per-sample CSV files (e.g., 2.csv, 3.csv, etc.). Each file must contain tax_id and either bases or (reference_length and average_depth_full_reference). --metadata_file CSV file with columns: sample_id, site, collection_date, pre_or_post_fish. --outdir Output directory where all results will be written. --threshold One of (--threshold or --auto) Integer Rarefaction depth (number of bases per sample after subsampling). Must be ≤ smallest sample total bases. --auto One of (--threshold or --auto) Flag Automatically sets rarefaction depth to the minimum total bases across samples. --n_reps Integer (default=50) Number of rarefaction replicates per sample. Used to compute mean and 95% CI for alpha diversity. --seed Integer (default=1) Random seed for reproducibility (rarefaction + NMDS). --run_beta If set, runs the full beta diversity analysis on rarefied mean matrices. --beta_metrics Space-separated beta diversity metrics. Supported: braycurtis, jaccard, canberra, euclidean_hellinger. --permutations Integer (default=999) Number of permutations used in PERMANOVA tests. --label_max_points Integer (default=250) Maximum number of ordination points labeled with collection dates (prevents overcrowding).
###UniFrac analysis [used only at the level of species] Argument Description bacteria/gtdb_unifrac.py --matrix_csv Species or genus matrix --metadata_csv Sample metadata --outdir Output directory --metrics Space-separated list of metrics --make_nmds Produce NMDS plots --make_pcoa Produce PCoA plots --make_timeseries Produce distance-over-time plots --make_trajectory Perform trajectory analysis
Argument Description --species_matrix_csv Species-level abundance matrix --metadata_csv Metadata CSV --gtdb_metadata_tsv_gz GTDB bac120 metadata file --gtdb_tree GTDB bac120 tree (.tree or .tree.gz) --outdir Output directory --no_weighted Skip weighted UniFrac --no_unweighted Skip unweighted UniFrac --chunk_size Chunk size for reading metadata --make_pcoa Generate PCoA plots
###Project structure graph bacteria_analysis/ │ ├── bacteria/ │ ├── run_pipeline.py │ ├── beta_diversity_pipeline.py │ ├── gtdb_unifrac.py │ ├── rarefaction.py │ ├── alpha.py │ └── ... │ ├── gtdb_r226/ │ ├── bac120_metadata_r226.tsv.gz │ └── bac120_r226.tree.gz │ ├── sample_metadata.csv ├── relative_abundance_tables/ └── README.md