Skip to content

MPUSP/snakemake-assembly-postprocessing

Repository files navigation

snakemake-assembly-postprocessing

Snakemake GitHub actions status run with conda run with apptainer workflow catalog

A Snakemake workflow for the post-processing of microbial genome assemblies.

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog.

Detailed information about input data and workflow configuration can also be found in the config/README.md.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository.

Workflow overview:

Workflow overview

  1. Parse samples.csv table containing the samples's meta data (python)
  2. Annotate assemblies using one of the following tools:
    1. NCBI's Prokaryotic Genome Annotation Pipeline (PGAP). Note: needs to be installed manually
    2. prokka, a fast and light-weight prokaryotic annotation tool
    3. bakta, a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
  3. Create a QC report for the assemblies using Quast
  4. Create a pangenome analysis (orthologs/homologs) using Panaroo

Installation

Step 1: Clone this repository

git clone https://github.com/MPUSP/snakemake-assembly-postprocessing.git
cd snakemake-assembly-postprocessing

Step 2: Install dependencies

It is recommended to install snakemake and run the workflow with conda or mamba. Miniforge is the preferred conda-forge installer and includes conda, mamba and their dependencies.

Step 3: Create snakemake environment

This step creates a new conda environment called snakemake-assembly-postprocessing.

mamba create -c conda-forge -c bioconda -n snakemake-assembly-postprocessing snakemake pandas
conda activate snakemake-assembly-postprocessing

Step 4: Install PGAP

  • if you want to use PGAP for annotation, it needs to be installed separately
  • PGAP can be downloaded from https://github.com/ncbi/pgap. Please follow the installation instructions there.
  • Define the path to the pgap.py script (located in the scripts folder) in the config file (recommended: ./resources)

Deployment options

To run the workflow from command line, change the working directory.

cd snakemake-assembly-postprocessing

Adjust options in the default config file config/config.yml. Before running the complete workflow, you can perform a dry run using:

snakemake --cores 1 --dry-run

To run the workflow with test files using conda:

snakemake --cores 2 --sdm conda --directory .test

To run the workflow with test files using apptainer:

snakemake --cores 2 --sdm conda apptainer --directory .test

Authors

References

Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. PMID: 24642063. https://doi.org/10.1093/bioinformatics/btu153.

Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom, 7(11):000685 2021. PMID: 34739369. https://doi.org/10.1099/mgen.0.000685.

Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. RefSeq: Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res, 2021 Jan 8;49(D1):D1020-D1028. https://doi.org/10.1093/nar/gkaa1105

Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 29(8):1072-5, 2013. PMID: 23422339. https://doi.org/10.1093/bioinformatics/btt086.

Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol. 21(1):180, 2020. PMID: 32698896. https://doi.org/10.1186/s13059-020-02090-4.

Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 10, 33, 2021. https://doi.org/10.12688/f1000research.29032.2.