Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Metagenomics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
## Select a specific pipeline for more info:

* [Estimating host reads](Estimate_host_reads_in_raw_data)
* [Removing host reads](Remove_host_reads)
* [Removing human or host reads](Remove_host_reads)
* [Illumina](Illumina)

<br>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,22 @@

---

**Date:** November X, 2025
**Revision:** B
**Date:** March X, 2026
**Revision:** B
**Document Number:** GL-DPPD-7105-B

**Submitted by:**
Jihan Yehia (GeneLab Data Processing Team)

**Approved by:**
Samrawit Gebre (OSDR Project Manager)
Jonathan Galazka (OSDR Project Manager)
Danielle Lopez (OSDR Deputy Project Manager)
Jonathan Galazka (OSDR Project Scientist)
Amanda Saravia-Butler (GeneLab Science Lead)
Amanda Saravia-Butler (OSDR Subject Matter Expert)
Barbara Novak (GeneLab Data Processing Lead)

## Updates from previous revision
* Updated kraken2 from version 2.1.1 to 2.1.6
* In [Step 1](#1-build-kraken2-database), used kraken2's `k2` wrapper script for `download-taxonomy` because the script supports HTTPS download as mentioned [here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2)
* In [Step 1](#1-build-kraken2-database), replaced direct `kraken2-build` calls with kraken2's `k2` wrapper script, which provides a higher-level interface for database construction including HTTPS download support, as described [here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2)

---

Expand All @@ -40,55 +39,72 @@ Barbara Novak (GeneLab Data Processing Lead)

|Program|Version*|Relevant Links|
|:------|:-----:|------:|
|kraken2|`kraken2 -v`|[https://github.com/DerrickWood/kraken2/wiki](https://github.com/DerrickWood/kraken2/wiki)|
|kraken2| 2.1.6 |[https://github.com/DerrickWood/kraken2/wiki](https://github.com/DerrickWood/kraken2/wiki)|

> \* Exact versions utilized for a given dataset are available along with the processing commands for each specific dataset (this is due to how the system may need to be updated regularly).
> \* Exact version utilized for a given dataset are available along with the processing commands for each specific dataset (this is due to how the system may need to be updated regularly).

---

# General processing overview with example commands

> Output files listed in **bold** below are included with each Metagenomics dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
> Output files listed in **bold** below are included with each Metagenomics dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
> **Note:** the examples provided are for human read removal. For the more generic host read removal, please refer to the Workflow documentation.

## 1. Build kraken2 database
This depends on the appropriate host genome. This example is done with the human genome ([GRCh38.p13 | GCF_000001405.39](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39)).
> **Note:** It is recommended to use NCBI with kraken2 because sequences not downloaded from NCBI may require explicit assignment of taxonomy information before they can be used to build the database, as mentioned [here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).

```bash
# downloading and decompressing reference genome
wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
gunzip GCF_000001405.39_GRCh38.p13_genomic.fna.gz
### 1. Build kraken2 database
For human read removal, building the database relies on kraken2 downloading the sequences from NCBI,
which may include multiple different versions of the genome (for example, both [GRCh38.p14](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40)
and [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1)). To use this pipeline for
read removal of hosts other than human, download the host fasta file and add it to the kraken2 database
using the [`k2 add-to-library` function as described in the [kraken2 documentation](https://github.com/DerrickWood/kraken2/wiki/Manual#add-to-library) or
in the NASA GeneLab [Estimate Host Reads pipeline](../../Estimate_host_reads_in_raw_data/Pipeline_GL-DPPD-7109_Versions/)

> **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from
NCBI may require explicit assignment of taxonomy information before they can be used to build the
database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).

```bash
# download human fasta sequences
k2 download-library --db kraken2-human-db/ --library human --threads 30 --no-masking

# building kraken2 database
# Download NCBI taxonomic information
k2 download-taxonomy --db kraken2-human-db/
kraken2-build --add-to-library GCF_000001405.39_GRCh38.p13_genomic.fna --no-masking --db kraken2-human-db/
kraken2-build --build --db kraken2-human-db/ --threads 30 --no-masking
kraken2-build --clean --db kraken2-human-db/

# Build the database
k2 build --db kraken2-human-db/ --kmer-len 35 --minimizer-len 31 --threads 30

# Clean up intermediate files
k2 clean --db kraken2-human-db/
```

**Parameter Definitions:**

* `download-taxonomy` - downloads taxonomic mapping information via [k2 wrapper script](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2)
* `--add-to-library` - adds the fasta file to the library of sequences being included
* `--db` - specifies the directory we are putting the database in
* `--threads` - specifies the number of threads to use
* `--no-masking` - prevents [masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences) of low-complexity sequences
* `--build` - specifies to construct kraken2-formatted database
* `--clean` - specifies to remove unnecessarily intermediate files
- `download-library` - downloads the references-containing library
- `--library` - specifies the library to download (here the human reference genome)
- `--no-masking` - disables masking of low-complexity sequences. For additional
information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
- `--db` - specifies the name of the directory for the kraken2 database
- `--threads` - specifies the number of threads to use
- `download-taxonomy` - downloads taxonomic mapping information
- `build` - builds a kraken2-formatted database from the library files
- `--kmer-len` - k-mer length in bp (default: 35).
- `--minimizer-len` - minimizer length in bp (default: 31)
- `clean` - removes unneeded intermediate files

**Input data:**

* None

**Output data:**

* kraken2 database files (hash.k2d, opts.k2d, and taxo.k2d)
- kraken2_human_db/ (Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files)

---

## 2. Filter out human-classified reads

>**Note:** HRrm is the file suffix NASA GeneLab uses for all files with human reads removed. If host reads are removed, the suffix is updated to "HostRm" instead.

### Example if paired-end reads

```bash
Expand All @@ -97,8 +113,8 @@ kraken2 --db kraken2-human-db --gzip-compressed --threads 4 --use-names --paired
--unclassified-out sample-1_R#.fastq sample-1-R1.fq.gz sample-1-R2.fq.gz

# renaming and gzipping output files
mv sample-1_R_1.fastq sample-1_R1_HRremoved_raw.fastq && gzip sample-1_R1_HRremoved_raw.fastq
mv sample-1_R_2.fastq sample-1_R2_HRremoved_raw.fastq && gzip sample-1_R2_HRremoved_raw.fastq
mv sample-1_R_1.fastq sample-1_R1_HRrm.fastq && gzip sample-1_R1_HRrm.fastq
mv sample-1_R_2.fastq sample-1_R2_HRrm.fastq && gzip sample-1_R2_HRrm.fastq
```

**Parameter Definitions:**
Expand All @@ -115,25 +131,26 @@ mv sample-1_R_2.fastq sample-1_R2_HRremoved_raw.fastq && gzip sample-1_R2_HRremo

**Input data:**

* kraken2-human-db (Kraken2 human database directory, from [Step 1](#1-build-kraken2-database))
* sample-1-R1.fq.gz (gzipped forward-reads fastq file)
* sample-1-R2.fq.gz (gzipped reverse-reads fastq file)

**Output data:**

* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read))
* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
* **sample-1_R1_HRremoved_raw.fastq.gz** (host-read removed, gzipped forward-reads fastq file)
* **sample-1_R2_HRremoved_raw.fastq.gz** (host-read removed, gzipped reverse-reads fastq file)
* **sample-1_R1_HRrm.fastq.gz** (human-read removed, gzipped forward-reads fastq file)
* **sample-1_R2_HRrm.fastq.gz** (human-read removed, gzipped reverse-reads fastq file)

### Example if single-end reads

```bash
kraken2 --db kraken2-human-db --gzip-compressed --threads 4 --use-names \
--output sample-1-kraken2-output.txt --report sample-1-kraken2-report.tsv \
--unclassified-out sample-1_HRremoved_raw.fastq sample-1.fq.gz
--unclassified-out sample-1_HRrm.fastq sample-1.fq.gz

# gzipping output file
gzip sample-1_HRremoved_raw.fastq
gzip sample-1_HRrm.fastq
```

**Parameter Definitions:**
Expand All @@ -155,21 +172,23 @@ gzip sample-1_HRremoved_raw.fastq

* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read))
* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
* **sample-1_HRremoved_raw.fastq.gz** (host-read removed, gzipped reads fastq file)
* **sample-1_HRrm.fastq.gz** (human-read removed, gzipped reads fastq file)

---

## 3. Generate a kraken2 summary report
Utilizes a Unix-like command-line.

>**Note:** In the case of host removal rather than human removal, Update the script below to change "human" to "host".

```bash
total_fragments=$(wc -l sample-1-kraken2-output.txt | sed 's/^ *//' | cut -f 1 -d " ")

fragments_retained=$(grep -w -m 1 "unclassified" sample-1-kraken2-report.tsv | cut -f 2)

perc_removed=$(printf "%.2f\n" $(echo "scale=4; 100 - ${fragments_retained} / ${total_fragments} * 100" | bc -l))

cat <( printf "Sample_ID\tTotal_fragments_before\tTotal_fragments_after\tPercent_host_reads_removed\n" ) \
cat <( printf "Sample_ID\tTotal_fragments_before\tTotal_fragments_after\tPercent_human_reads_removed\n" ) \
<( printf "Sample-1\t${total_fragments}\t${fragments_retained}\t${perc_removed}\n" ) > Human-read-removal-summary.tsv
```

Expand All @@ -180,7 +199,7 @@ cat <( printf "Sample_ID\tTotal_fragments_before\tTotal_fragments_after\tPercent

**Output data:**

* <Human>-read-removal-summary.tsv (a tab-separated file with 4 columns: "Sample_ID", "Total_fragments_before", "Total_fragments_after", "Percent_host_reads_removed")
* Human-read-removal-summary.tsv (a tab-separated file with 4 columns: "Sample_ID", "Total_fragments_before", "Total_fragments_after", "Percent_human_reads_removed")
* *Note: The percent human reads removed from each sample is provided in the assay table on the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).*

---
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ We recommend installing Singularity on a system wide level as per the associated

### 2. Download the Workflow Files

All files required for utilizing the NF_MGRemoveHostReads GeneLab workflow for removing host reads from metagenomics sequencing data are in the [workflow_code](workflow_code) directory. To get a copy of the latest *NF_MGRemoveHostReads* version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
All files required for utilizing the NF_MGRemoveHostReads GeneLab workflow for removing human or host reads from metagenomics sequencing data are in the [workflow_code](workflow_code) directory. To get a copy of the latest *NF_MGRemoveHostReads* version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:

```bash
wget <insert URL here>
Expand Down Expand Up @@ -109,11 +109,16 @@ export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity

This workflow can be run by providing the path to a text file containing a single-column list of unique sample identifiers, an example of which is shown [here](workflow_code/unique-sample-IDs.txt), along with the path to input data (raw reads of samples).

It also requires setting the root directory for where kraken2 reference databases are (or will be) stored. The workflow assumes databases follow the naming convention `kraken2-<host>-db`. If a database for a specified host is not found in the provided root directory, the workflow automatically builds one from scratch and saves it in the same directory using that name convention.
It also requires setting the root directory for where kraken2 reference databases are (or will be) stored. The workflow assumes databases follow the naming convention `kraken2-<host>-db` and determines how to obtain the database in the following order:

In cases where the workflow is to build kraken2 database from scratch, it is important to ensure that the host organism's details are present in hosts.csv table [here](workflow_code/assets/hosts.csv). If not, they should be appended to the table in the following format: `name,species,refseq_ID,genome_build,FASTA_URL`.
1. **Database already exists** in the provided root directory: the workflow reuses it directly, skipping the build step entirely.
2. **Pre-built database URL**: the workflow downloads and unpacks the database into the root directory. An example is available in the [reference database info page](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads/reference-database-info.md), which describes how the human database was generated for a previous version of this workflow and how to obtain it for reuse.
3. **Custom FASTA file**: the workflow builds a database from scratch using the provided FASTA file (local path or URL).
4. **hosts.csv lookup**: the workflow parses the hosts table [here](workflow_code/assets/hosts.csv) to retrieve the FASTA URL and build the database from scratch.

Alternatively, a pre-built database can be manually downloaded and unpacked into the root directory, provided it follows the same naming convention. An example of which is available in the [reference database info page](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/reference-database-info.md), which describes how the human database was generated for a previous version of this workflow and how to obtain it for reuse.
For options 2 and 3, it is recommended that the user provide host genome assembly name and accession number as parameters for protocol documentation purposes. For option 4, if the host entry in hosts.csv has no FASTA URL specified, the workflow uses Kraken's `k2 download-library` to download sequences directly from NCBI, automatically capturing assembly details, and, therefore, no additional parameters are needed. If a FASTA URL is present in hosts.csv, the assembly name and accession are also picked up from the same table entry for protocol documentation.

In all cases where the database is built from scratch, it is saved in the root directory using the `kraken2-<host>-db` naming convention for reuse in subsequent runs.

> Note: Nextflow commands use both single hyphen arguments (e.g. -profile) that denote general Nextflow arguments and double hyphen arguments (e.g. --osd) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.

Expand All @@ -125,7 +130,7 @@ Alternatively, a pre-built database can be manually downloaded and unpacked into
nextflow run main.nf \
-resume \
-profile singularity \
--ref_dbs_Dir <Path to kraken2 reference database> \
--ref_dbs_dir <Path to kraken2 reference database> \
--sample_id_list unique_sample_ids.txt \
--reads_dir <Path to reads directory>
```
Expand All @@ -145,7 +150,7 @@ nextflow run main.nf \
* `mamba` - instructs Nextflow to use conda environments via the mamba package manager
* Other option (can be combined with the software environment option above using a comma, e.g. `-profile slurm,singularity`):
* `slurm` - instructs Nextflow to use the [Slurm cluster management and job scheduling system](https://slurm.schedmd.com/overview.html) to schedule and run the jobs on a Slurm HPC cluster
* `--ref_dbs_Dir` - Specifies the path to the directory where kraken2 databases are or will be stored
* `--ref_dbs_dir` - Specifies the path to the directory where kraken2 databases are or will be stored (type: string, default: "./kraken2DBs")
* `--sample_id_list` - path to a single-column file with unique sample identifiers (type: string, default: null)
> *Note: An example of this file is provided in the [workflow_code](workflow_code) directory [here](workflow_code/unique-sample-IDs.txt).*
* `--reads_dir` - path to raw reads directory (type: string, default: null)
Expand All @@ -159,10 +164,22 @@ nextflow run main.nf \
* `--single_suffix` - raw reads suffix that follows the unique part of sample names (type: string, default: "_raw.fastq.gz")
* `--R1_suffix` - raw forward reads suffix that follows the unique part of sample names (type: string, default: "_R1_raw.fastq.gz")
* `--R2_suffix` - raw reverse reads suffix that follows the unique part of sample names (type: string, default: "_R2_raw.fastq.gz")
* `--single_out_suffix` - host-removed reads suffix that follows the unique part of sample names (type: string, default: "_HRremoved_raw.fastq.gz")
* `--R1_out_suffix` - host-removed forward reads suffix that follows the unique part of sample names (type: string, default: "_R1_HRremoved_raw.fastq.gz")
* `--R2_out_suffix` - host-removed reverse reads suffix that follows the unique part of sample names (type: string, default: "_R2_HRremoved_raw.fastq.gz")
* `--out_suffix` - human- or host-removed reads suffix that follows the unique part of sample names (type: string, default: "_HRrm" for human-removed, or "_HostRm" for host-removed)
* `--host` - host organism, should match the entry provided under `name` column in [hosts.csv](workflow_code/assets/hosts.csv) (type: string, default: "human")
* `--db_url` - URL to download a pre-built Kraken2 database (type: string, default: ""). If provided, `--ref_fasta` is not required.
* `--ref_fasta` - local path or URL of a custom reference genome FASTA file to build a Kraken2 database from (type: string, default: ""). If provided, `--assembly_name` and `--assembly_acc` are also recommended.
* `--assembly_name` - assembly name for the host reference genome, e.g. `GRCh38.p14` (type: string, default: ""). Required when using `--db_url` or `--ref_fasta` for protocol documentation.
* `--assembly_acc` - assembly accession number for the host reference genome, e.g. `GCF_000001405.40` (type: string, default: ""). Required when using `--db_url` or `--ref_fasta` for protocol documentation.


These parameters and additional optional arguments for the NF_MGRemoveHostReads workflow, including options that may not be immediately useful for most users, can be viewed by running the following command:

```bash
nextflow run NF_MGRemoveHostReads_1.0.0/main.nf --help
```

See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details common to all nextflow workflows.


<br>

Expand Down
Loading