-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Genome mining tools largely benefit from larger sequence databases to search in. However, current large genomic databases (such as NCBI RefSeq) often contain a high level of redundancy. Popular BLAST-based genome mining tools such as cblaster and CAGECAT propagate this redundancy into their hit sets, which tends to overcomplicate downstream analyses, such as visualisations using clinker. These redundant hit sets are often filtered manually or by gene cluster-level clustering using BiG-SCAPE. However, a manual curation is often too crude, cutting away gene cluster diversity, while curation using BiG-SCAPE ignores HGT events and other evolutionary events in which the gene cluster evolves differently than the overall genome.
CAGEcleaner is a redundancy removal tool for gene clusters that dereplicates hits at the genome level. In addition, it can go the extra mile to preserve gene cluster diversity.
More specifically, it links contigs with hits to the hosting genome assemblies, automatically downloads these assemblies, or couples them back to a local database, and then performs a speedy ANI-based full genome dereplication. Finally, it couples the dereplicated genomes back to their corresponding gene cluster hits. Optionally, it can enrich the reduced hit sets again by recovering the more diverse hits from an assessment of gene cluster contents and homology scores.
CAGEcleaner currently does not come with its own genome QC module. Low-quality assemblies may be retained unnecessarily, resulting in a less efficient dereplication.
CAGEcleaner has primarily been developed as an auxiliary tool to be used in interaction with cblaster and/or CAGECAT. As such, it expects and produces files that can be processed by these tools. If sufficient demand, parsing additional file formats will be implemented.
Head over to the Installation page to get CAGEcleaner up and running. The How-to page will then show you how to clean up your hit sets.
If you found CAGEcleaner useful, please cite our manuscript:
De Vrieze, L., Biltjes, M., Lukashevich, S., Tsurumi, K., Masschelein, J. (2025) CAGEcleaner: reducing genomic redundancy in gene cluster mining. Bioinformatics, Volume 41, Issue 7, https://doi.org/10.1093/bioinformatics/btaf373
CAGEcleaner relies heavily on the skDER genome dereplication tool and its main dependency skani, so please give these proper credit as well.
Salamzade, R., & Kalan, L. R. (2023). skDER: microbial genome dereplication approaches for comparative and metagenomic applications. https://doi.org/10.1101/2023.09.27.559801`
Shaw, J., & Yu, Y. W. (2023). Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods, 20(11), 1661–1665. https://doi.org/10.1038/s41592-023-02018-3