CRISPRloci provides an automated and comprehensive in silico characterization of CRISPR-Cas system on bacterial and archaeal genomes. It is a full suite for CRISPR locus characteriztion that includes CRISPR array orientation, detection of conserved leaders, cas gene annotation and subtype classification.
The web server interface of CRISPRloci is freely available at: rna.informatik.uni-freiburg.de/trunk/CRISPRloci
If you use CRISPRloci, please cite our papers:
- CRISPRidentify: identification of CRISPR arrays using machine learning approach Alexander Mitrofanov, Omer S. Alkhnbashi, Sergey A. Shmakov, Kira S. Makarova, Eugene V. Koonin, Rolf Backofen, Nucleic Acids Research, DOI: https://doi.org/10.1093/nar/gkaa1158
- Casboundary: Automated definition of integral Cas cassettes Victor A. Padilha, Omer S. Alkhnbashi, Van Dinh Tran, Shiraz A. Shah, André C. P. L. F. de Carvalho, Rolf Backofen, Bioinformatics, 2020, DOI: 10.1093/bioinformatics/btaa984.
- CRISPRCasIdentifier: Machine learning for accurate identification and classification of CRISPR-Cas systems Victor A. Padilha, Omer S. Alkhnbashi, Shiraz A. Shah, André C. P. L. F. de Carvalho, Rolf Backofen, GigaScience, 2020, DOI: 10.1093/gigascience/giaa062.
CRISPRloci_standalone.py has been tested with Python 3.7 To run it, we recommend installing the same library versions we used. Since we exported our classifiers following the model persistence guideline from scikit-learn, it is not guaranteed that they will work properly if loaded using other Python and/or library versions. For such, we recommend the use of our docker image or a conda virtual environment. They make it easy to install the correct Python and library dependencies without affecting the whole operating system (see below).
wget https://github.com/BackofenLab/CRISPRloci/archive/1.0.0.tar.gz
tar -xzf 1.0.0.tar.gz
Second step: download the Hidden Markov (HMM) and Machine Learning (ML) models
Due to GitHub's file size constraints, we made our HMM and ML models available in Google Drive. You can download them here and here. Save both tar.gz files inside CRISPRcasIdentifier's directory. It is not necessary to extract them, since the tool will do that the first time it is run.
Third step: download the Hidden Markov (HMM) and Machine Learning (ML) models
We made our HMM and ML models available in Google Drive. You can download them from the following links:
Save all tar.gz files inside Casboundary's folder. It is not necessary to extract them, since the tool will do that the first time it is run.
First we install Miniconda for python 3. Miniconda can be downloaded from here: miniconda.
Install Miniconda.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh chmod +x Miniconda3-latest-Linux-x86_64.sh ./Miniconda3-latest-Linux-x86_64.sh
Create and activate environment for CRISPRloci.
conda env create -f CRISPRloci-env.yml -n CRISPRloci-env
conda activate CRISPRloci-env
After using CRISPRloci_standalone.py you can deactivate the environment.
conda deactivate
In order to test the dna mode please execute the following the command
python3.7 CRISPRloci_standalone.py -f Example/NC_005230.fasta -st dna
In order to test the protein mode tun the following command:
python3.7 CRISPRloci_standalone.py -f Example/NC_005230_proteins.fasta -st protein
In order to test the viral more execute the following command
python3.7 CRISPRloci_standalone.py -f Example/Input3.fa -st virus
-
-f: input DNA fasta file path. -
-outputfolder where results will be stored -
-cpunumber of CPUs to use
-
-r: list of regressors to use. Available options: CART, ERT or SVM (default: ERT). -
-c: list of classifiers to use. Available options: CART, ERT or SVM (default: ERT). -
-s: list of HMM models to use, available options: HMM1 to HMM5 and HMM2019 (default: HMM2019). The models HMM1 to HMM5 are the ones that were originally used in our paper. HMM2019 consists on the HMM models that were obtained from the most recent dataset by Makarova (2019). Setting this parameter is enough for the tool to know which ML models should be used. -
-sc: sequence completeness (used only when-stis set todna). Available options:completeorpartial(default:complete). -
-m: run mode. Available options:classification,regressionorcombined(default:combined). -
-cg: maximum number of contiguous gaps allowed in a cassette (default: 1) -
-cm: which ML models to use. Available options:ERTorDNN(default:ERT).
-
--modelModel for the CRISPR array classification. Takes values: 8, 9, 10, ALL and specifies the classification model. The default value isALL -
--strandSpecifies if the array orientation should be predicted. Available optionsTrue/False. The default value isTrue -
--is_elementSpecifies if IS-Elements should be predicted. Available optionsTrue/False. The default value isFalse -
--fast_runoption to skip the candidate enhancement. Available optionsTrue/False. The default value isFalse -
--degeneratedallows search for degenerated repeat candidates on both ends of the CRISPR array candidate. Available optionsTrue/False. The default value:True -
--min_len_repspecifies the minimum length of repeats in a CRISPR array. The default value: 21 -
--max_len_repspecifies the maximum length of repeats in a CRISPR array. The default value: 55 -
--min_len_spacerspecifies the minimum average length of spacers in a CRISPR array. The default value: 18 -
--max_len_spacerspecifies the maximum average length of spacers in a CRISPR array. The default value: 78 -
--min_repeatsspecifies the minimum number of repeats in a CRISPR array. The default value: 3 -
--enhancement_max_minspecifies if the filter approximation based on the max. and min. elements should be built The default value is True -
--enhancement_start_endspecifies if the filter approximation based on the max. and min. elements should be built The default value is True -
--max_identical_spacersspecifies the number of maximum identical spacers in a CRISPR array. The default value: 4 -
--max_identical_cluster_spacersspecifies the number of maximum identical consequent spacers in a CRISPR array. The default value: 3 -
--margin_degeneratedspecifies the maximum length difference between a new spacer sequence (obtained with the search of degenerated repeats) and the average value of spacer length in the array. The default value: 30 -
--max_edit_distance_enhancedspecifies the number of editing operations for candidate enhancement. The default value: 6
-
-f: input proteins fasta file path. -
-outputfolder where results will be stored -
-cpunumber of CPUs to use
-
-r: list of regressors to use. Available options: CART, ERT or SVM (default: ERT). -
-c: list of classifiers to use. Available options: CART, ERT or SVM (default: ERT). -
-s: list of HMM models to use, available options: HMM1 to HMM5 and HMM2019 (default: HMM2019). The models HMM1 to HMM5 are the ones that were originally used in our paper. HMM2019 consists on the HMM models that were obtained from the most recent dataset by Makarova (2019). Setting this parameter is enough for the tool to know which ML models should be used. -
-sc: sequence completeness (used only when-stis set todna). Available options:completeorpartial(default:complete). -
-m: run mode. Available options:classification,regressionorcombined(default:combined). -
-cg: maximum number of contiguous gaps allowed in a cassette (default: 1) -
-cm: which ML models to use. Available options:ERTorDNN(default:ERT).
-
-f: input proteins fasta file path. -
-outputfolder where results will be stored -
-cpunumber of CPUs to use
evalue_sthe number of expected hits with spacer database. The default value: 1e-7
