A robust and reproducible workflow integrating CARD with UniRef90 through MD5-based sequence merging into a deduplicated database for Antibiotic Resistance Gene (ARG) identification.
- Josefa Welling (@josefawelling)
- Python 3.10+
- Standard Unix command-line utilities (like
gzip,sort,comm,cat)
Create a UniCARD environment with Biopython, DIAMOND and pandas installed with Conda:
conda create -n unicard -c conda-forge -c bioconda biopython diamond pandas
conda activate unicardPlease make sure that the given dependencies are met.
Please obtain a copy of this workflow by cloning this repository.
git clone https://github.com/IKIM-Essen/uniCARD.gitFor basic usage you will need these two python scripts:
uniCARD.pyuniCARD_filtering.py
If you would like to test the functionality of UniCARD after cloning you can follow the steps listed in the Test case section.
If you directly want to build your own UniCARD database and analyse your data start in section UniCARD database creation.
We have build a small test dataset which you can find in the test/ folder. It contains the CARD v4.0.0 database and a very small subset (200 protein sequences) of the UniRef90 ’2025 03’ as well as a reduced fasta file of protein sequences from a wastwater shotgun metagenomic sample.
Please be aware that the resulting UniCARD database of this test is not suitable for ARG indentification since it does not cover the entire know protein space.
To test the functionality of UniCARD's scripts after you have cloned the repository, please run the following commands in your terminal:
-
Check that you have created and activated an environment which matches all requirements (see Requirements section)
-
Navigate into the uniCARD folder that was created during cloning:
cd uniCARD -
Build UniCARD test database:
python uniCARD.py --card testdata/card.json --uniref testdata/uniref90_reduced.fasta.gz --outdir testdata/unicard_test
-
Make DIAMOND database for the UniCARD fasta file:
diamond makedb --in testdata/unicard_test/uniCARD.fasta.gz -d testdata/unicard_test/uniCARD
-
Run DIAMOND in blastp mode:
diamond blastp -q testdata/WWDI2431_reduced.faa.gz -d testdata/unicard_test/uniCARD.dmnd -o testdata/WWDI2431_reduced_output.tsv
-
Run UniCARDs filtering script for final list of ARGs:
python uniCARD_filtering.py --infile testdata/WWDI2431_reduced_output.tsv --card_hierarchy testdata/unicard_test/CARD_hierarchy_v4.0.0.json --outfile testdata/WWDI2431_reduced.csv
This was just a test run, please make sure to not use the database or ARG results for your data analysis!
You can create a UniCARD database with your desired CARD and UniRef90 version. Please download the original databases before creating UniCARD (see download database section)
python uniCARD.py --card /path/to/card.json --uniref /path/to/uniref90.fasta.gz --outdir /output/folder/| Argument | Description | Required |
|---|---|---|
--card |
Path to card.json |
✅ |
--uniref |
Path to UniRef90 FASTA (can be .gz) |
✅ |
--outdir |
Output directory for results | ✅ |
--cores |
Number of CPU cores for parallel tasks | ❌ (default: 4) |
--batch |
Number of sequences processed in a batch | ❌ (default: 6000) |
- Adjust
--coresand--batchbased on your system's CPU and RAM.
All files are saved in --outdir:
| File | Description |
|---|---|
uniCARD.fasta.gz |
Final deduplicated UniCARD FASTA file |
CARD_hierarchy_v{card_version}.json |
JSON file holding CARDs hierarchy |
uniCARD_md5_annotation.tsv |
MD5 hash → sequence description mapping |
uniCARD_md5_seq.tsv |
MD5 hash → sequence mapping |
md5_removed.tsv |
Sequences filtered out as duplicates |
uniCARD.log |
Checkpoint log to resume interrupted runs |
The script automatically logs completed steps to uniCARD.log in your --outdir.
If the pipeline is interrupted or gets killed — previously completed steps will be skipped when rerunning.
After creation the UniCARD database can be used for the annotation of ARGs in protein sequences of isolate or metagenomic genomes with DIAMOND following these steps:
-
Make DIAMOND database for UniCARD fasta file (More details in DIAMONDs documentation):
diamond makedb --in /path/to/uniCARD.fasta.gz -d /path/to/diamond/uniCARD -
Run DIAMOND in blastp mode (More details in DIAMONDs documentation):
Please make sure that the output folder already exists.
diamond blastp -q /path/to/genome/proteins.faa.gz -d /path/to/diamond/uniCARD.dmnd -o /path/to/output.tsv -
Run UniCARDs filtering script for final list of ARGs:
python uniCARD_filtering.py --infile /path/to/diamond/output.tsv --card_hierarchy /path/to/CARD_hierarchy_v{card_version}.json --outfile /path/to/output/file.csv
| Argument | Description | Required |
|---|---|---|
--infile |
Path to UniCARD DIAMOND output tsv (can be .gz) |
✅ |
--card_hierarchy |
Path to CARDs hierarchy json file | ✅ |
--outfile |
Output csv file to store final list of ARGs | ✅ |
--filter_window |
Percent window used for filtering, default 1% | ❌ (default: 0.01) |
The resulting csv file of your ARG identification contains the following 16 fields (12 from standard DIAMOND output):
query,target,ARO_ID,%identity,length,#mismatches,#gaps,query_start,query_end,target_start,target_end,evalue,bitscore,classes,antibiotics,ARO_name
Additional information from CARD's hierarchy:
-
ARO_ID: ARO accession as CARD identifier
-
classes: semicolon-separated list of all associated antibiotic drug classes, e.g.:
aminocoumarin antibiotic; aminoglycoside antibiotic -
antibiotics: semicolon-separated list of associated antibiotics, e.g.:
amikacin; gentamicin; neomycin -
ARO_name: ARO name of ARG
-
Download CARD database (https://card.mcmaster.ca/download)
wget https://card.mcmaster.ca/latest/data tar -xvf data ./card.json
-
Download the
uniref90.fasta.gzfrom UniProt:wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
This project is licensed under the BSD 2-Clause License
Pull requests and feature suggestions are very welcome! Feel free to fork and submit improvements.
A paper is on it's way! If you use UniCARD in your work, don't forget to give credits to the authors by citing the URL of this repository.