UniCARD

A robust and reproducible workflow integrating CARD with UniRef90 through MD5-based sequence merging into a deduplicated database for Antibiotic Resistance Gene (ARG) identification.

Authors

Josefa Welling (@josefawelling)

Requirements

Python 3.10+
Standard Unix command-line utilities (like gzip, sort, comm, cat)

Create a UniCARD environment with Biopython, DIAMOND and pandas installed with Conda:

conda create -n unicard -c conda-forge -c bioconda biopython diamond pandas
conda activate unicard

Please make sure that the given dependencies are met.

Usage

Please obtain a copy of this workflow by cloning this repository.

    git clone https://github.com/IKIM-Essen/uniCARD.git

For basic usage you will need these two python scripts:

uniCARD.py
uniCARD_filtering.py

If you would like to test the functionality of UniCARD after cloning you can follow the steps listed in the Test case section.

If you directly want to build your own UniCARD database and analyse your data start in section UniCARD database creation.

Test case

We have build a small test dataset which you can find in the test/ folder. It contains the CARD v4.0.0 database and a very small subset (200 protein sequences) of the UniRef90 ’2025 03’ as well as a reduced fasta file of protein sequences from a wastwater shotgun metagenomic sample.

Please be aware that the resulting UniCARD database of this test is not suitable for ARG indentification since it does not cover the entire know protein space.

To test the functionality of UniCARD's scripts after you have cloned the repository, please run the following commands in your terminal:

Check that you have created and activated an environment which matches all requirements (see Requirements section)
Navigate into the uniCARD folder that was created during cloning:
```
cd uniCARD
```

Build UniCARD test database:

python uniCARD.py --card testdata/card.json --uniref testdata/uniref90_reduced.fasta.gz --outdir testdata/unicard_test

Make DIAMOND database for the UniCARD fasta file:

diamond makedb --in testdata/unicard_test/uniCARD.fasta.gz -d testdata/unicard_test/uniCARD

Run DIAMOND in blastp mode:

diamond blastp -q testdata/WWDI2431_reduced.faa.gz -d testdata/unicard_test/uniCARD.dmnd -o testdata/WWDI2431_reduced_output.tsv

Run UniCARDs filtering script for final list of ARGs:

python uniCARD_filtering.py --infile testdata/WWDI2431_reduced_output.tsv --card_hierarchy testdata/unicard_test/CARD_hierarchy_v4.0.0.json --outfile testdata/WWDI2431_reduced.csv

This was just a test run, please make sure to not use the database or ARG results for your data analysis!

UniCARD database creation

You can create a UniCARD database with your desired CARD and UniRef90 version. Please download the original databases before creating UniCARD (see download database section)

python uniCARD.py --card /path/to/card.json --uniref /path/to/uniref90.fasta.gz --outdir /output/folder/

Arguments

Argument	Description	Required
`--card`	Path to `card.json`	✅
`--uniref`	Path to UniRef90 FASTA (can be `.gz`)	✅
`--outdir`	Output directory for results	✅
`--cores`	Number of CPU cores for parallel tasks	❌ (default: 4)
`--batch`	Number of sequences processed in a batch	❌ (default: 6000)

Adjust --cores and --batch based on your system's CPU and RAM.

Output Files

All files are saved in --outdir:

File	Description
`uniCARD.fasta.gz`	Final deduplicated UniCARD FASTA file
`CARD_hierarchy_v{card_version}.json`	JSON file holding CARDs hierarchy
`uniCARD_md5_annotation.tsv`	MD5 hash → sequence description mapping
`uniCARD_md5_seq.tsv`	MD5 hash → sequence mapping
`md5_removed.tsv`	Sequences filtered out as duplicates
`uniCARD.log`	Checkpoint log to resume interrupted runs

Checkpointing

The script automatically logs completed steps to uniCARD.log in your --outdir. If the pipeline is interrupted or gets killed — previously completed steps will be skipped when rerunning.

Application of UniCARD for ARG identification

After creation the UniCARD database can be used for the annotation of ARGs in protein sequences of isolate or metagenomic genomes with DIAMOND following these steps:

Make DIAMOND database for UniCARD fasta file (More details in DIAMONDs documentation):
diamond makedb --in /path/to/uniCARD.fasta.gz -d /path/to/diamond/uniCARD
Run DIAMOND in blastp mode (More details in DIAMONDs documentation):
Please make sure that the output folder already exists.
diamond blastp -q /path/to/genome/proteins.faa.gz -d /path/to/diamond/uniCARD.dmnd -o /path/to/output.tsv

Run UniCARDs filtering script for final list of ARGs:

python uniCARD_filtering.py --infile /path/to/diamond/output.tsv --card_hierarchy /path/to/CARD_hierarchy_v{card_version}.json --outfile /path/to/output/file.csv

Arguments of `uniCARD_filtering.py`

Argument	Description	Required
`--infile`	Path to UniCARD DIAMOND output tsv (can be `.gz`)	✅
`--card_hierarchy`	Path to CARDs hierarchy json file	✅
`--outfile`	Output csv file to store final list of ARGs	✅
`--filter_window`	Percent window used for filtering, default 1%	❌ (default: 0.01)

Output of `uniCARD_filtering.py`

The resulting csv file of your ARG identification contains the following 16 fields (12 from standard DIAMOND output):

query,target,ARO_ID,%identity,length,#mismatches,#gaps,query_start,query_end,target_start,target_end,evalue,bitscore,classes,antibiotics,ARO_name

Additional information from CARD's hierarchy:

ARO_ID: ARO accession as CARD identifier
classes: semicolon-separated list of all associated antibiotic drug classes, e.g.: aminocoumarin antibiotic; aminoglycoside antibiotic
antibiotics: semicolon-separated list of associated antibiotics, e.g.: amikacin; gentamicin; neomycin
ARO_name: ARO name of ARG

Database Downloads

CARD (Comprehensive Antibiotic Resistance Database)

Download CARD database (https://card.mcmaster.ca/download)

wget https://card.mcmaster.ca/latest/data
tar -xvf data ./card.json

UniRef90

Download the uniref90.fasta.gz from UniProt:

wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz

License

This project is licensed under the BSD 2-Clause License

Contributions

Pull requests and feature suggestions are very welcome! Feel free to fork and submit improvements.

Citation

A paper is on it's way! If you use UniCARD in your work, don't forget to give credits to the authors by citing the URL of this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
paper_analysis		paper_analysis
testdata		testdata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
uniCARD.py		uniCARD.py
uniCARD_filtering.py		uniCARD_filtering.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniCARD

Authors

Requirements

Usage

Test case

UniCARD database creation

Arguments

Output Files

Checkpointing

Application of UniCARD for ARG identification

Arguments of `uniCARD_filtering.py`

Output of `uniCARD_filtering.py`

Database Downloads

CARD (Comprehensive Antibiotic Resistance Database)

UniRef90

License

Contributions

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UniCARD

Authors

Requirements

Usage

Test case

UniCARD database creation

Arguments

Output Files

Checkpointing

Application of UniCARD for ARG identification

Arguments of uniCARD_filtering.py

Output of uniCARD_filtering.py

Database Downloads

CARD (Comprehensive Antibiotic Resistance Database)

UniRef90

License

Contributions

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Arguments of `uniCARD_filtering.py`

Output of `uniCARD_filtering.py`

Packages