Skip to content

IKIM-Essen/uniCARD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniCARD

A robust and reproducible workflow integrating CARD with UniRef90 through MD5-based sequence merging into a deduplicated database for Antibiotic Resistance Gene (ARG) identification.


Authors

  • Josefa Welling (@josefawelling)

Requirements

  • Python 3.10+
  • Standard Unix command-line utilities (like gzip, sort, comm, cat)

Create a UniCARD environment with Biopython, DIAMOND and pandas installed with Conda:

conda create -n unicard -c conda-forge -c bioconda biopython diamond pandas
conda activate unicard

Please make sure that the given dependencies are met.


Usage

Please obtain a copy of this workflow by cloning this repository.

    git clone https://github.com/IKIM-Essen/uniCARD.git

For basic usage you will need these two python scripts:

  • uniCARD.py
  • uniCARD_filtering.py

If you would like to test the functionality of UniCARD after cloning you can follow the steps listed in the Test case section.

If you directly want to build your own UniCARD database and analyse your data start in section UniCARD database creation.

Test case

We have build a small test dataset which you can find in the test/ folder. It contains the CARD v4.0.0 database and a very small subset (200 protein sequences) of the UniRef90 ’2025 03’ as well as a reduced fasta file of protein sequences from a wastwater shotgun metagenomic sample.

Please be aware that the resulting UniCARD database of this test is not suitable for ARG indentification since it does not cover the entire know protein space.

To test the functionality of UniCARD's scripts after you have cloned the repository, please run the following commands in your terminal:

  1. Check that you have created and activated an environment which matches all requirements (see Requirements section)

  2. Navigate into the uniCARD folder that was created during cloning:

    cd uniCARD
  3. Build UniCARD test database:

    python uniCARD.py --card testdata/card.json --uniref testdata/uniref90_reduced.fasta.gz --outdir testdata/unicard_test
  4. Make DIAMOND database for the UniCARD fasta file:

    diamond makedb --in testdata/unicard_test/uniCARD.fasta.gz -d testdata/unicard_test/uniCARD
  5. Run DIAMOND in blastp mode:

    diamond blastp -q testdata/WWDI2431_reduced.faa.gz -d testdata/unicard_test/uniCARD.dmnd -o testdata/WWDI2431_reduced_output.tsv
  6. Run UniCARDs filtering script for final list of ARGs:

    python uniCARD_filtering.py --infile testdata/WWDI2431_reduced_output.tsv --card_hierarchy testdata/unicard_test/CARD_hierarchy_v4.0.0.json --outfile testdata/WWDI2431_reduced.csv

This was just a test run, please make sure to not use the database or ARG results for your data analysis!


UniCARD database creation

You can create a UniCARD database with your desired CARD and UniRef90 version. Please download the original databases before creating UniCARD (see download database section)

python uniCARD.py --card /path/to/card.json --uniref /path/to/uniref90.fasta.gz --outdir /output/folder/

Arguments

Argument Description Required
--card Path to card.json
--uniref Path to UniRef90 FASTA (can be .gz)
--outdir Output directory for results
--cores Number of CPU cores for parallel tasks ❌ (default: 4)
--batch Number of sequences processed in a batch ❌ (default: 6000)
  • Adjust --cores and --batch based on your system's CPU and RAM.

Output Files

All files are saved in --outdir:

File Description
uniCARD.fasta.gz Final deduplicated UniCARD FASTA file
CARD_hierarchy_v{card_version}.json JSON file holding CARDs hierarchy
uniCARD_md5_annotation.tsv MD5 hash → sequence description mapping
uniCARD_md5_seq.tsv MD5 hash → sequence mapping
md5_removed.tsv Sequences filtered out as duplicates
uniCARD.log Checkpoint log to resume interrupted runs

Checkpointing

The script automatically logs completed steps to uniCARD.log in your --outdir. If the pipeline is interrupted or gets killed — previously completed steps will be skipped when rerunning.


Application of UniCARD for ARG identification

After creation the UniCARD database can be used for the annotation of ARGs in protein sequences of isolate or metagenomic genomes with DIAMOND following these steps:

  1. Make DIAMOND database for UniCARD fasta file (More details in DIAMONDs documentation):
    diamond makedb --in /path/to/uniCARD.fasta.gz -d /path/to/diamond/uniCARD

  2. Run DIAMOND in blastp mode (More details in DIAMONDs documentation):
    Please make sure that the output folder already exists.
    diamond blastp -q /path/to/genome/proteins.faa.gz -d /path/to/diamond/uniCARD.dmnd -o /path/to/output.tsv

  3. Run UniCARDs filtering script for final list of ARGs:

    python uniCARD_filtering.py --infile /path/to/diamond/output.tsv --card_hierarchy /path/to/CARD_hierarchy_v{card_version}.json --outfile /path/to/output/file.csv

Arguments of uniCARD_filtering.py

Argument Description Required
--infile Path to UniCARD DIAMOND output tsv (can be .gz)
--card_hierarchy Path to CARDs hierarchy json file
--outfile Output csv file to store final list of ARGs
--filter_window Percent window used for filtering, default 1% ❌ (default: 0.01)

Output of uniCARD_filtering.py

The resulting csv file of your ARG identification contains the following 16 fields (12 from standard DIAMOND output):

query,target,ARO_ID,%identity,length,#mismatches,#gaps,query_start,query_end,target_start,target_end,evalue,bitscore,classes,antibiotics,ARO_name

Additional information from CARD's hierarchy:

  1. ARO_ID: ARO accession as CARD identifier

  2. classes: semicolon-separated list of all associated antibiotic drug classes, e.g.: aminocoumarin antibiotic; aminoglycoside antibiotic

  3. antibiotics: semicolon-separated list of associated antibiotics, e.g.: amikacin; gentamicin; neomycin

  4. ARO_name: ARO name of ARG


Database Downloads

CARD (Comprehensive Antibiotic Resistance Database)

  1. Download CARD database (https://card.mcmaster.ca/download)

    wget https://card.mcmaster.ca/latest/data
    tar -xvf data ./card.json

UniRef90

  1. Download the uniref90.fasta.gz from UniProt:

    wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz

License

This project is licensed under the BSD 2-Clause License


Contributions

Pull requests and feature suggestions are very welcome! Feel free to fork and submit improvements.


Citation

A paper is on it's way! If you use UniCARD in your work, don't forget to give credits to the authors by citing the URL of this repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors