-
Notifications
You must be signed in to change notification settings - Fork 3
dir struct
deepcelllineage (DCL) is an organization (not a repository). For consistency among DCL contributors, we recommend you make a directory called deepcelllineage/ and clone the DCL repos that you want to contribute to within it.
Since mitolin is the flaghsip directory for DCL we decided to keep all of our data in subdirectories of mitolin/data. To avoid submitting large data files to version control, we added the appropriate file extensions (eg .fasta) to mitolin/.gitignore.
The mitolin/ file structure currently looks like this (last updated 27/6/2019):
deepcelllineage
└── mitolin
├── data
│ ├── gen
│ │ └── nguyen_nc_2018
│ │ ├── 20190502-KB
│ │ │ └── ind2
│ │ │ ├── errout
│ │ │ └── genomic
│ │ │ └── Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001
| | | ├── *.vcf (.gitignore)
| | | └── *.vcf.idx (.gitignore)
│ │ ├── 20190527-pairr1r2
│ │ │ └── ind2
│ │ │ ├── r1_list_pairs.txt
│ │ │ ├── r1_list.txt
│ │ │ ├── r2_list_pairs.txt
│ │ │ └── r2_list.txt
│ │ ├── 20190613-fastas
│ │ │ ├── 1457-1sttry.dict
│ │ │ ├── 1457-1sttry.fa (.gitignore)
│ │ │ └── 1457-1sttry.fa.fai (.gitignore)
│ │ └── 20190627-vcf2fasta
│ ├── raw
│ │ ├── nguyen_nc_2018
│ │ │ ├── ind1
| | | | ├── .keep
│ │ │ ├── ind2
| | | | ├── .keep
│ │ │ └── ind3
| | | └── .keep
│ │ └── play
│ │ └── 1eg33MBfastq
│ │ └── ind1
│ └── ref
│ ├── broad
│ │ └── bundles
│ │ ├── b37
│ │ └── Mutect2
│ └── ucsc
│ └── bundles
│ └── hg19
│ ├── ucsc.hg19.dict
│ ├── ucsc.hg19.fasta (.gitignore)
│ └── ucsc.hg19.fasta.fai (.gitignore)
├── nb
│ ├── 20190527_pair_r1r2.ipynb
│ └── 20190617-farm-on-hpc.md
└── src
├── 20190502-DB-gatk4-fastq2snv.sh
└── vcf2fasta_v0.0.1.sh
data/ref
- ref is short for reference, in our case, we are interested in human reference sequences
- relevant files can be downloaded from GATK, UCSC, Sanger, NCBI, etc.
data/raw
- files that can be downloaded from an online repository (e.g. SRA)
- generally these should be fastq files
- see DCL tutorial to access LaBrock data
data/gen
- files that are generated using code that has a date-matched note in the notebook
nb/directory or a script in thesrc/directory
In order to be able to run each other's code without too many adjustments, please date your markdown, Jupyter Notebook, and source files. Please use corresponding dates for your generated file directories and add a .keep file to your gen/nguyen_nc_2018/DATE-shortdescriptor/ind#/ directory if all generated files are on the .gitignore list.
Note the file structure gen/nguyen_nc_2018/DATE-shortdescriptor/ind#/
- For now, we are using
ind#to refer to individuals 1, 2, & 3 (rather than their SRA numbers). This is why the generated data is going in anguyen_nc_2018/parent directory. When we add data from other individuals, outside of this publication, we will reorganize.
If you have worked on an open source data science project, where you shared your work in a single repo, or have good ideas from other previous experiences on how to organize this, please get in touch by any of the methods in the Communication section of the Overview README.
Please use this layout for one PR before offering improvements.