3D genome builder (3DGB) is a workflow to build 3D models of genomes from HiC raw data and to integrate omics data on the produced models for further visual exploration. 3DGB bundles HiC-Pro, PASTIS and custom Python scripts into a unified Snakemake workflow with limited inputs (see Preparing Required Files). 3DGB produces annotated 3D models of genome in PDB and G3D formats.
git clone https://github.com/data-fun/3d-genome-builder.git
cd 3d-genome-builderDownload the latest version here
Install Singularity:
sudo apt install -y ./singularity-container_3.8.7_amd64.debVerify version:
$ singularity --version
singularity version 3.8.7
Install conda.
Install mamba:
conda install mamba -n base -c conda-forgeCreate conda environment and install dependendies:
mamba env create -f binder/environment.ymlLoad conda environment:
conda activate 3DGBwget --ciphers=DEFAULT:@SECLEVEL=1 https://zerkalo.curie.fr/partage/HiC-Pro/hicpro_3.1.0_ubuntu.img -P imagesIf this command fails, try with an alternate download link:
wget https://zenodo.org/record/8376626/files/hicpro_3.1.0_ubuntu.img -P imagesCheck the integrity of the image:
$ md5sum images/hicpro_3.1.0_ubuntu.img
d480e636397c14e187608e50309eb9af images/hicpro_3.1.0_ubuntu.imgVerify HiC-Pro version with:
$ singularity exec images/hicpro_3.1.0_ubuntu.img HiC-Pro --version
[...]
HiC-Pro version 3.1.0and bowtie2 version:
$ singularity exec images/hicpro_3.1.0_ubuntu.img bowtie2 --version 2>/dev/null | head -n 1
/usr/local/conda/envs/hicpro/bin/bowtie2-align-s version 2.4.4Create and edit a configuration file in yaml format. See for instance the template config_template.yml
The reference genome fasta file must be located in WORKING_DIR/genome.fasta where WORKING_DIR is the name of the working directory as specified in your config file.
If you already have fastq files stored locally or some fastq files are not available on GEO or SRA, you can use these files providing they are in the proper directory structure:
WORKING_DIR/
├── fastq_files
│ ├── ID1
│ │ ├── ID1_R1.fastq.gz
│ │ └── ID1_R2.fastq.gz
│ ├── ID2
│ │ ├── ID2_R1.fastq.gz
│ │ └── ID2_R2.fastq.gz
│ ├── ID3
│ │ ├── ID3_R1.fastq.gz
│ │ └── ID3_R2.fastq.gz
│ └── ID4
│ ├── ID4_R1.fastq.gz
│ └── ID4_R2.fastq.gz
└── genome.fasta
WORKING_DIRis the name of the working directory as specified in your config file.- Paired-end fastq files are in the directory
WORKING_DIR/fastq_files/IDxwithIDxthe identifier of the paired fastq files. Fastq identifiers are reported in the config file. Please note fastq files have to follow the pattern<sample ID>_R<1 or 2>.fastq.gz.
Note
Please strictly follow this file organization as it is required by the 3DGB workflow.
Run 3DGB:
snakemake --profile smk_profile -j 4 --configfile YOUR-CONFIG.ymlNote
- Adapt
YOUR-CONFIG.ymlto the exact name of the config file you created.- Option
-j 4tells Snakemake to use up to 4 cores. If you are more cores available, you can increase this value (e.g.-j 16).
Or with debugging options:
snakemake --profile smk_profile_debug -j 4 --configfile YOUR-CONFIG.yml --verboseDepending on the number and size of fastq files, the 3D construction will take a couple of hours to run.
For troubleshooting, have a look to log files in WORKING_DIR/logs, where WORKING_DIR is the name of the working directory as specified in your config file.
To map quantitative values on the model run:
python ./scripts/map_parameter.py --pdb path/to/structure.pdb --bedgraph path/to/annotation.bedgraph --output path/to/output.pdbQuantitative values should be formatted in a 4-column bedgraph file (chromosome/start/stop/value):
chr1 0 50000 116.959
chr1 50000 100000 48.4495
chr1 100000 150000 22.8726
chr1 150000 200000 84.3106
chr1 200000 250000 113.109
Each bead of the model will be assigned a quantitative value. The resolution in the bedgraph file should match the resolution used to build the model.
Upon completion, the WORKING_DIR should look like this:
WORKING_DIR/
├── contact_maps
├── dense_matrix
├── fastq_files
├── HiC-Pro
├── logs
├── pastis
├── sequence
└── structure
The following paths contain the most interesting results:
WORKING_DIR/contact_maps/*.png: contact maps.WORKING_DIR/HiC-Pro/output/hic_results/pic/*/*.pdf: graphical summaries of read alignments produced by Hi-C Pro.WORKING_DIR/pastis/structure_RESOLUTION.pdb: raw 3D models (in PDB format) produced by Pastis.WORKING_DIR/structure/RESOLUTION/structure_cleaned.*: final (annotated) 3D models in PDB and G3D formats.
Note
WORKING_DIRis the name of the working directory as specified in your config file.RESOLUTIONis the resolution of the Hi-C data specified in the config file.
To visualize 3D model structures (.pdb and .g3d files), follow this quick tutorial.
For visualization purpose, you can build the graph of all computational steps involved in the 3D construction of the genome.
snakemake --profile smk_profile --configfile YOUR-CONFIG.yml --rulegraph | dot -Tpdf > rules.pdfwhere YOUR-CONFIG.yml should be replaced by the name of the config file you created.
With wildcards:
snakemake --profile smk_profile --configfile YOUR-CONFIG.yml --dag | dot -Tpdf > dag.pdf
