A Snakemake workflow for the post-processing of microbial genome assemblies.
The usage of this workflow is described in the Snakemake Workflow Catalog.
Detailed information about input data and workflow configuration can also be found in the config/README.md.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository.
Workflow overview:
- Parse
samples.csvtable containing the samples's meta data (python) - Annotate assemblies using one of the following tools:
- NCBI's Prokaryotic Genome Annotation Pipeline (PGAP). Note: needs to be installed manually
- prokka, a fast and light-weight prokaryotic annotation tool
- bakta, a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
- Create a QC report for the assemblies using Quast
- Create a pangenome analysis (orthologs/homologs) using Panaroo
Step 1: Clone this repository
git clone https://github.com/MPUSP/snakemake-assembly-postprocessing.git
cd snakemake-assembly-postprocessingStep 2: Install dependencies
It is recommended to install snakemake and run the workflow with conda or mamba. Miniforge is the preferred conda-forge installer and includes conda, mamba and their dependencies.
Step 3: Create snakemake environment
This step creates a new conda environment called snakemake-assembly-postprocessing.
mamba create -c conda-forge -c bioconda -n snakemake-assembly-postprocessing snakemake pandas
conda activate snakemake-assembly-postprocessingStep 4: Install PGAP
- if you want to use PGAP for annotation, it needs to be installed separately
- PGAP can be downloaded from https://github.com/ncbi/pgap. Please follow the installation instructions there.
- Define the path to the
pgap.pyscript (located in thescriptsfolder) in theconfigfile (recommended:./resources)
To run the workflow from command line, change the working directory.
cd snakemake-assembly-postprocessingAdjust options in the default config file config/config.yml.
Before running the complete workflow, you can perform a dry run using:
snakemake --cores 1 --dry-runTo run the workflow with test files using conda:
snakemake --cores 2 --sdm conda --directory .testTo run the workflow with test files using apptainer:
snakemake --cores 2 --sdm conda apptainer --directory .test- Dr. Rina Ahmed-Begrich
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID profile: https://orcid.org/0000-0002-0656-1795
- Dr. Michael Jahn
- Affiliation: Max-Planck-Unit for the Science of Pathogens (MPUSP), Berlin, Germany
- ORCID profile: https://orcid.org/0000-0002-3913-153X
- github page: https://github.com/m-jahn
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. PMID: 24642063. https://doi.org/10.1093/bioinformatics/btu153.
Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom, 7(11):000685 2021. PMID: 34739369. https://doi.org/10.1099/mgen.0.000685.
Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. RefSeq: Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res, 2021 Jan 8;49(D1):D1020-D1028. https://doi.org/10.1093/nar/gkaa1105
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 29(8):1072-5, 2013. PMID: 23422339. https://doi.org/10.1093/bioinformatics/btt086.
Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol. 21(1):180, 2020. PMID: 32698896. https://doi.org/10.1186/s13059-020-02090-4.
Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 10, 33, 2021. https://doi.org/10.12688/f1000research.29032.2.