A C++ library for indexing genome sequencing datasets by using colored de Bruijn Graph, hash functions and Bloom Filter. The implementation is based on this library by Diego Diaz Dominguez et al.
This tool requires:
- Ubuntu 18.04
First, download the library and move to library's root directory.
git clone git@github.com:cBioLab/hash_cdbg.git
cd hash_cdbgThen, prepare for compilation.
mkdir build && cd build
cmake ..If you want to specify the directory in which to install this library, you can use:
cmake .. -DCMAKE_INSTALL_PREFIX={your_install_path}/hash_cdbgFinally, compile and install the library.
make & make installTo use this library quickly, look in the util directory. build_cdbg.cpp is a code that builds an index, the detail of which is as follow:
#include <iostream>
#include <hash_cdbg/boss.hpp>
int main(int argc, char* argv[]) {
std::string input_file = "data/example.fastq";
size_t kmer_size = 30;
size_t n_threads = 1;
dbg_boss dbg_index(input_file, kmer_size, n_threads);
store_to_file(dbg_index, "example.cdbg");
return 0;
}To compile and execute this code, do the following:
cd hash_cdbg
g++ -o build_cdbg.out ./util/build_cdbg.cpp -I {your_install_path}/include -L {your_install_path}/lib -lhash_cdbg -lsdsl -ldivsufsort -ldivsufsort64 -lpthread -lz -std=c++17 -O3
./build_cdbg.outThe resulting example.cdbg is the index file.
To rebuild the original sequences from this index, do the following using build_fm_index.cpp and rebuild_seqs.cpp:
g++ -o build_fm_index.out ./util/build_fm_index.cpp -I {your_install_path}/include -L {your_install_path}/lib -lhash_cdbg -lsdsl -ldivsufsort -ldivsufsort64 -lpthread -lz
./build_fm_index.out data/example.fastq example
g++ -o rebuild_seqs.out ./util/rebuild_seqs.cpp -I {your_install_path}/include -L {your_install_path}/lib -lhash_cdbg -lsdsl -ldivsufsort -ldivsufsort64 -lpthread -lz -std=c++17 -O3
./rebuild_seqs.out example.cdbg example.fm_index 1 example.reThe resulting example.re.fasta is a fasta file that contains the example.fastq sequences and it's reverse complements rebuilt.
If you want to reproduce our experiments, see experiments README.
This tool does not support reads containing N bases. Run remove_n_read.cpp to remove reads containing N bases as a preprocessing step.
g++ -o remove_n_read.out ./util/remove_n_read.cpp -lpthread -std=c++17 -O3
./remove_n_read.out {your_fastq_file} {output_fastq_file}