Converts TSV files to two parquet files per study:
*_genetic_alterations.parquet- patterned after cgds.sql genetic_alteration*_genetic_profile_samples.parquet- patterned after cgds.sql genetic_profile_samples
go run ./cmd/cli/main.go -mode convert -tsv-dir ./data -parquet-dir ./outputwhere:
-tsv-dir: Path to the root directory containing study data (/datahub/public)-parquet-dir: Path to place individual CNA parquet files
Converts TSV files to three parquet files per study (includes derived denormalized format):
*_genetic_alterations.parquet- patterned after cgds.sql genetic_alteration*_genetic_profile_samples.parquet- patterned after cgds.sql genetic_profile_samples*_derived.parquet- derived record - gene-sample-measurement
go run ./cmd/cli/main.go -mode convert-with-derived -tsv-dir ./data -parquet-dir ./outputThe derived format contains one row per sample-gene combination with columns:
SAMPLE_ID- Sample identifierCANCER_STUDY- Cancer study identifierGENE_SYMBOL- Gene symbolGENETIC_PROFILE- Genetic profile identifierALTERATION- CNA alteration value
Combines individual parquet files into two final parquet files:
combined-*_genetic_alterations.parquetcombined-*_genetic_profile_samples.parquet
go run ./cmd/cli/main.go -mode combine -parquet-dir ./output -output combined-all-cna.parquetwhere:
-parquet-dir: Path where individual CNA parquet files reside-output: Base filename for combined parquet files (will generate two files with different suffixes)
Combines individual parquet files into three final parquet files (includes derived):
combined-*_genetic_alterations.parquetcombined-*_genetic_profile_samples.parquetcombined-*_derived.parquet
go run ./cmd/cli/main.go -mode combine-with-derived -parquet-dir ./output -output combined-all-cna.parquetwhere:
-parquet-dir: Path where individual CNA parquet files reside-output: Base filename for combined parquet files (will generate three files with different suffixes)