From 84180bee22972951ee2faf0cfb193829a74c80fd Mon Sep 17 00:00:00 2001 From: Rachel Colquhoun Date: Tue, 13 May 2025 14:59:08 +0100 Subject: [PATCH 1/5] add logo --- README.md | 4 +- docs/charon_logo.svg | 397 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 400 insertions(+), 1 deletion(-) create mode 100644 docs/charon_logo.svg diff --git a/README.md b/README.md index 8a20c8c..ac893d2 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,8 @@ # Charon _Clean Host Associated Reads Out Nanopore_ + + Probabilistically identify the host and microbial reads in a metagenomic dataset. UNDER ACTIVE DEVELOPMENT - this tool may not work out of the box but feel free to try it/make suggestions and watch the repo to be informed of releases! @@ -132,4 +134,4 @@ Outputs: The probability score is the relative probability of seeing the number of unique hits against this category if the read is truly from the positive distribution, rather than the negative distribution. -If `extract_file` and `--extract` specified, will output a file with a subset of input reads. \ No newline at end of file +If `extract_file` and `--extract` specified, will output a file with a subset of input reads. diff --git a/docs/charon_logo.svg b/docs/charon_logo.svg new file mode 100644 index 0000000..f0389ad --- /dev/null +++ b/docs/charon_logo.svg @@ -0,0 +1,397 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file From 840e66705526a45cce00ea998518daf5cf9efe9f Mon Sep 17 00:00:00 2001 From: Rachel Colquhoun Date: Tue, 13 May 2025 15:00:51 +0100 Subject: [PATCH 2/5] update logo --- docs/charon_logo.svg | 398 +------------------------------------------ 1 file changed, 1 insertion(+), 397 deletions(-) diff --git a/docs/charon_logo.svg b/docs/charon_logo.svg index f0389ad..c07e13a 100644 --- a/docs/charon_logo.svg +++ b/docs/charon_logo.svg @@ -1,397 +1 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file + \ No newline at end of file From 2e513a4e8d93aad366127f58667c0854e924b221 Mon Sep 17 00:00:00 2001 From: Rachel Colquhoun Date: Tue, 13 May 2025 15:10:47 +0100 Subject: [PATCH 3/5] update logo --- docs/charon_logo.svg | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/charon_logo.svg b/docs/charon_logo.svg index c07e13a..fd55248 100644 --- a/docs/charon_logo.svg +++ b/docs/charon_logo.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file From 2f6ccb84139ef4fee0fc66c2be2a1c8d34b9a2a4 Mon Sep 17 00:00:00 2001 From: Rachel Colquhoun Date: Wed, 14 May 2025 11:58:48 +0100 Subject: [PATCH 4/5] update README --- README.md | 39 ++++++++++++++++++++++----------------- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index ac893d2..59dd602 100644 --- a/README.md +++ b/README.md @@ -29,6 +29,15 @@ scores which could be used to set an acceptable threshold for e.g. releasing met ## Quick Start +### Download an Index + +Alternatively, a pre-built index for long reads is available via [Zenodo](https://zenodo.org/records/15398095). + +This index includes references from [HPRC](https://humanpangenome.org/) and representatives of Bacteria, Viruses, Archaea, Fungi and Sar with (mostly complete) genomes downloaded from NCBI RefSeq, including FDA-ARGOS. +The category names are `[microbial, human]`. + +The compressed index has size approximately 39GB and needs to be decompressed with gzip before use. + ### Build an Index This takes a tab separated file as input; the first column specifies the path to a reference file, the second column specified the name of the category which they belong to. @@ -45,15 +54,6 @@ The index can then be built with: charon index -t 8 ``` -### Download an Index - -Alternatively, a pre-built index is available if you know a free way to make a 4.1GB file publicly available. - -This index includes 10 references from [HPRC](https://humanpangenome.org/) and representatives of Bacteria, Viruses, Archaea and Fungi with complete genomes downloaded from NCBI RefSeq. -The category names are `[microbial, human]`. - -The uncompressed index has size approximately 6GB and needs to be decompressed with bgzip before use. - ### Dehost Classify `reads.fq.gz` using the categories in the index (one of which must be "host" or "human"): @@ -65,13 +65,19 @@ charon dehost -t 8 --db Additionally extract the microbial fraction of the input dataset ``` -charon dehost -t 8 --db --extract microbial --extract_file +charon dehost -t 8 --db --extract microbial --prefix ``` ## Installation -Currently available to build from source. -It has been developed on MacOS and Unix. +### Docker image +A docker image is hosted on dockerhub and can be pulled using +``` +docker pull rmcolq/charon +``` + +### Building from source +Charon has been developed on MacOS and Unix. Requires a compiler for C++14 and cmake > 3.9. ``` @@ -99,9 +105,8 @@ Options: --db FILE [required] Prefix for the index. - -e,--extract STRING Reads from this category in the index will be extracted to file. - --extract_file FILE Fasta/q file for output - --extract_file2 FILE Fasta/q file for output + -e,--extract STRING Reads from this category in the index will be extracted to file (options host, microbial, all). + --prefix PATH Prefix path for output extracted read files --chunk_size INT Read file is read in chunks of this size, to be processed in parallel within a chunk. [default: 100] --lo_hi_threshold FLOAT Threshold used during model fitting stage to decide if read should be used to train lo or hi distribution. [default: 0.15] @@ -109,7 +114,7 @@ Options: -d,--dist STRING Probability distribution to use for modelling. [default: kde] --min_length INT Minimum read length to classify. [default: 140] - --min_quality INT Minimum read quality to classify. [default: 10] + --min_quality INT Minimum read quality to classify. [default: 15] --min_compression FLOAT Minimum read gzip compression ratio to classify (a measure of how much information is in the read. [default: 0] --confidence INT Minimum difference between the top 2 unique hit counts. [default: 2] --host_unique_prop_lo_threshold INT Require non-host reads to have unique host proportion below this threshold for classification. [default: 0.05] @@ -134,4 +139,4 @@ Outputs: The probability score is the relative probability of seeing the number of unique hits against this category if the read is truly from the positive distribution, rather than the negative distribution. -If `extract_file` and `--extract` specified, will output a file with a subset of input reads. +If `extract_file` and `--prefix` specified, will output a file with a subset of input reads belonging to the names index category. Specifying `--extract all` will generate a file for both host and microbial reads (excludes unclassified). From 7f54c2cfdbf994737581d419b62b88a84d37ecf9 Mon Sep 17 00:00:00 2001 From: Rachel Colquhoun Date: Wed, 14 May 2025 12:02:34 +0100 Subject: [PATCH 5/5] update README --- README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/README.md b/README.md index 59dd602..a19b692 100644 --- a/README.md +++ b/README.md @@ -5,8 +5,6 @@ _Clean Host Associated Reads Out Nanopore_ Probabilistically identify the host and microbial reads in a metagenomic dataset. -UNDER ACTIVE DEVELOPMENT - this tool may not work out of the box but feel free to try it/make suggestions and watch the repo to be informed of releases! - [TOC]: # # Table of Contents