-
Notifications
You must be signed in to change notification settings - Fork 0
Usage
The inputs to the GCSC software are:
- TWAS summary statistics
- Co-regulation score matrix
- Gene set of interest
- GWAS sample size
The GCSC software will then calculate the per-gene heritability explained by the predicted expression of all genes, and the additional heritability explained by genes in the gene set of interest. It will also return an enrichment value. It is implemented in python3.
- --TWASdir: path to directory of TWAS summary statistics
- --coreg: path to directory of co-regulation score matrices
- --geneSets: Path to file containing gene set membership information
- --N: Sample size of GWAS
- --tissues: Optional flag for use if the user only wants to run GCSC on a subset of tissues.
- --out: Directory for output file
- --joint: Run a joint regression of all gene sets in the geneSets file instead of each one individually
-
TWAS summary statistics: These must be in the FUSION output format, which must include the columns "FILE","CHR","P0","TWAS.Z". FUSION can only be run on one chromosome and one tissue at a time. As a result, there will be 23 files (22 chromosomes + 1 file containing both chromosome 6 plus the MHC) for each tissue. These files will end with the ".dat" extension.
GCSC assumes that the results for each tissue are in a separate directory whose path contains the tissue name. For example, /n/user/TWAS_Results/Whole_Blood/Height might be the path to the TWAS results for Height using Whole_Blood. GCSC will automatically replace the word "tissue" in the directory path with each tissue that is being used. You therefore should replace the actual tissue names with the word tissue, for example /n/user/TWAS_Results/tissue/Height. In this way, GCSC will figure out the paths for the TWAS results for multiple tissues. GCSC ignores the FUSION file containing the MHC region.
-
Co-regulation score matrices for each tissue: These are available pre-calculated for all GTExv7 tissues here. You should place these files all in the same directory, and give this path to the --coreg flag.
-
GWAS sample size: The GWAS sample size. Use the effective GWAS sample size if known.
-
Gene set membership information: gene set membership should be provided in a comma-separated table, where the columns are gene names (in ensembl gene id nomenclature, e.g. ENSG00000134014) and rows correspond to gene sets. The index of each row should be the gene set name. If using binary gene sets, each column should contain a 0 if that gene is not in that gene set, and a 1 if it is in the gene set. GCSC also allows continuous gene sets, in which case each value should be the gene's value in the gene set. GCSC will perform a bivariable regression for each of the gene sets, with the two terms as "all genes" and "gene set genes". An example gene set membership file is available here. All genes is our gene universe (which consist of protein coding genes in GTExv7 must be in the file). This list of genes is available here.
-
Tissues: By default, GCSC will use gene expression data from all tissues. However, if you want to restrict to a certain set of tissues, you can do so using the tissues flag. For example, you could use --tissues Liver Whole_Blood
python gcsc.py --geneSets /pathtoFile/SampleGeneSets.csv --TWASdir /n/user/TWAS_Results/tissue/Height --N 100000 --coreg PathToCoregFileDir --out .
gcsc.py outputs a 4 column file, with each parameter name, its value, its standard error and the two-sided p-value.