Finalizing main overhaul steps #103

rdmorin · 2025-02-01T16:31:56Z

In this PR I've added one more important S3 class named bed_data. I also fixed a few bugs that I had introduced in the previous PR. A few (hopefully) inspiring examples that show of the new functionality:

Example 1: maf_data

get_coding_ssm_status will fail to run (by design) when given a MAF object that is from a different genome build than the one specified via the projection parameter:

#all lymphoma genes from bundled NHL gene list
coding_tabulated_df = get_coding_ssm_status()

#this example will fail because hg38 is not supported by this function (yet)
coding_tabulated_df = get_coding_ssm_status(maf_data=
                        get_coding_ssm(projection = "hg38"))
# Error in get_coding_ssm_status(maf_data = get_coding_ssm(projection = "hg38")) : 
# Currently only grch37 projection (hg19 genome build) is supported.

Example 2: bed_data

We can now more consistently and easily make a bed_data object from bed-like data frames. The output will always have the first three columns with the same naming pattern and will ensure that the chromosome prefixing matches the genome build.

#basic usage, adding custom names from bundled ashm data frame
regions_bed = create_bed_data( GAMBLR.data::grch37_ashm_regions,
                          fix_names = "concat",
                          concat_cols = c("gene","region"),
                          sep="-")
                          
# This example intentionally fails
ashm_maf = get_ssm_by_regions(regions_bed = regions_bed,
                              these_samples_metadata = my_meta,
                               projection="hg38")

Example 3: bed_data

The create_bed_data function has a few convenience features that minimize the risk that the user will do something wrong (e.g. it ensures unique names in the name column). This means downstream functions can rely on consistency for ease of programming. These objects also know what genome build it belongs to. For objects in GAMBLR.data, this is inferred from the genome build string that is embedded in the variable name. This protects against accidental mixing of region/coordinates and genome builds with other functions. This example code below shows how this was incorporated into get_ssm_by_regions in preparation for cool_overlaps. Now duplicated "names", weirdly ordered columns, chr-prefix mismatches, or the lack of a column named "name" etc can become a thing of the past!

if(!missing(regions_bed) & "bed_data" %in% class(regions_bed)){
      regions_df = dplyr::select(regions_bed,1:4) %>%
        dplyr::rename(c("Chromosome"="chrom",
                        "Start_Position"="start",
                        "End_Position"="end",
                        "region"="name")) 
    }else{

rdmorin added 8 commits February 1, 2025 06:16

bug fixes, bed_data support

6bc7a4c

stringr

fe39251

stop-gap for duplicate rows in metadata

e7522a0

extending bed_data support to other functions

08bdc1c

more compatability

60c1676

tweak

4f96e1f

remove stringr

91251eb

remove stringr for real this time

f620ee8

Kdreval approved these changes Feb 1, 2025

View reviewed changes

Kdreval merged commit 7609c4c into master Feb 1, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalizing main overhaul steps #103

Finalizing main overhaul steps #103

Uh oh!

rdmorin commented Feb 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Finalizing main overhaul steps #103

Finalizing main overhaul steps #103

Uh oh!

Conversation

rdmorin commented Feb 1, 2025

Example 1: maf_data

Example 2: bed_data

Example 3: bed_data

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants