Skip to content

dir struct

deena-b edited this page Jun 28, 2019 · 1 revision

Directory Structure for DeepCellLineage (DCL)

Parent directory

deepcelllineage (DCL) is an organization (not a repository). For consistency among DCL contributors, we recommend you make a directory called deepcelllineage/ and clone the DCL repos that you want to contribute to within it.

mitolin directory

Since mitolin is the flaghsip directory for DCL we decided to keep all of our data in subdirectories of mitolin/data. To avoid submitting large data files to version control, we added the appropriate file extensions (eg .fasta) to mitolin/.gitignore.

The mitolin/ file structure currently looks like this (last updated 27/6/2019):

deepcelllineage
└── mitolin
    ├── data
    │   ├── gen
    │   │   └── nguyen_nc_2018
    │   │       ├── 20190502-KB
    │   │       │   └── ind2
    │   │       │       ├── errout
    │   │       │       └── genomic
    │   │       │           └── Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001
    |   |       |               ├── *.vcf           (.gitignore)
    |   |       |               └── *.vcf.idx       (.gitignore)
    │   │       ├── 20190527-pairr1r2
    │   │       │   └── ind2
    │   │       │       ├── r1_list_pairs.txt
    │   │       │       ├── r1_list.txt
    │   │       │       ├── r2_list_pairs.txt
    │   │       │       └── r2_list.txt
    │   │       ├── 20190613-fastas
    │   │       │   ├── 1457-1sttry.dict
    │   │       │   ├── 1457-1sttry.fa              (.gitignore)
    │   │       │   └── 1457-1sttry.fa.fai          (.gitignore)  
    │   │       └── 20190627-vcf2fasta
    │   ├── raw
    │   │   ├── nguyen_nc_2018
    │   │   │   ├── ind1
    |   |   |   |   ├── .keep
    │   │   │   ├── ind2
    |   |   |   |   ├── .keep
    │   │   │   └── ind3
    |   |   |       └── .keep
    │   │   └── play
    │   │       └── 1eg33MBfastq
    │   │           └── ind1
    │   └── ref
    │       ├── broad
    │       │   └── bundles
    │       │       ├── b37
    │       │       └── Mutect2
    │       └── ucsc
    │           └── bundles
    │               └── hg19
    │                   ├── ucsc.hg19.dict
    │                   ├── ucsc.hg19.fasta         (.gitignore)
    │                   └── ucsc.hg19.fasta.fai     (.gitignore)
    ├── nb
    │   ├── 20190527_pair_r1r2.ipynb
    │   └── 20190617-farm-on-hpc.md
    └── src
        ├── 20190502-DB-gatk4-fastq2snv.sh
        └── vcf2fasta_v0.0.1.sh

What goes in these directories?

data/ref

  • ref is short for reference, in our case, we are interested in human reference sequences
  • relevant files can be downloaded from GATK, UCSC, Sanger, NCBI, etc.

data/raw

  • files that can be downloaded from an online repository (e.g. SRA)
  • generally these should be fastq files
  • see DCL tutorial to access LaBrock data

data/gen

  • files that are generated using code that has a date-matched note in the notebook nb/ directory or a script in the src/ directory

Standards

In order to be able to run each other's code without too many adjustments, please date your markdown, Jupyter Notebook, and source files. Please use corresponding dates for your generated file directories and add a .keep file to your gen/nguyen_nc_2018/DATE-shortdescriptor/ind#/ directory if all generated files are on the .gitignore list.

Note the file structure gen/nguyen_nc_2018/DATE-shortdescriptor/ind#/

  • For now, we are using ind# to refer to individuals 1, 2, & 3 (rather than their SRA numbers). This is why the generated data is going in a nguyen_nc_2018/ parent directory. When we add data from other individuals, outside of this publication, we will reorganize.

Ideas are welcome

If you have worked on an open source data science project, where you shared your work in a single repo, or have good ideas from other previous experiences on how to organize this, please get in touch by any of the methods in the Communication section of the Overview README.

Please use this layout for one PR before offering improvements.

Clone this wiki locally