Usage

Overview

The inputs to the GCSC software are:

TWAS summary statistics
Co-regulation score matrix
Gene set of interest
GWAS sample size

The GCSC software will then calculate the per-gene heritability explained by the predicted expression of all genes, and the additional heritability explained by genes in the gene set of interest. It will also return an enrichment value. It is implemented in python3.

Parameters

--TWASdir: path to directory of TWAS summary statistics
--coreg: path to directory of co-regulation score matrices
--geneSets: Path to file containing gene set membership information
--N: Sample size of GWAS
--tissues: Optional flag for use if the user only wants to run GCSC on a subset of tissues.
--out: Directory for output file
--joint: Run a joint regression of all gene sets in the geneSets file instead of each one individually

Explanation of parameters

TWAS summary statistics: These must be in the FUSION output format, which must include the columns "FILE","CHR","P0","TWAS.Z". FUSION can only be run on one chromosome and one tissue at a time. As a result, there will be 23 files (22 chromosomes + 1 file containing both chromosome 6 plus the MHC) for each tissue. These files will end with the ".dat" extension.

GCSC assumes that the results for each tissue are in a separate directory whose path contains the tissue name. For example, /n/user/TWAS_Results/Whole_Blood/Height might be the path to the TWAS results for Height using Whole_Blood. GCSC will automatically replace the word "tissue" in the directory path with each tissue that is being used. You therefore should replace the actual tissue names with the word tissue, for example /n/user/TWAS_Results/tissue/Height. In this way, GCSC will figure out the paths for the TWAS results for multiple tissues. GCSC ignores the FUSION file containing the MHC region.
Co-regulation score matrices for each tissue: These are available pre-calculated for all GTExv7 tissues here. You should place these files all in the same directory, and give this path to the --coreg flag.
GWAS sample size: The GWAS sample size. Use the effective GWAS sample size if known.
Gene set membership information: gene set membership should be provided in a comma-separated table, where the columns are gene names (in ensembl gene id nomenclature, e.g. ENSG00000134014) and rows correspond to gene sets. The index of each row should be the gene set name. If using binary gene sets, each column should contain a 0 if that gene is not in that gene set, and a 1 if it is in the gene set. GCSC also allows continuous gene sets, in which case each value should be the gene's value in the gene set. GCSC will perform a bivariable regression for each of the gene sets, with the two terms as "all genes" and "gene set genes". An example gene set membership file is available here. All genes is our gene universe (which consist of protein coding genes in GTExv7 must be in the file). This list of genes is available here.
Tissues: By default, GCSC will use gene expression data from all tissues. However, if you want to restrict to a certain set of tissues, you can do so using the tissues flag. For example, you could use --tissues Liver Whole_Blood

Sample command

python gcsc.py --geneSets /pathtoFile/SampleGeneSets.csv --TWASdir /n/user/TWAS_Results/tissue/Height --N 100000 --coreg PathToCoregFileDir --out .

Output

gcsc.py outputs a 4 column file, with each parameter name, its value, its standard error and the two-sided p-value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage

Overview

Parameters

Explanation of parameters

Sample command

Output

Uh oh!

Uh oh!

Clone this wiki locally