Skip to content

Conversation

@rdmorin
Copy link
Contributor

@rdmorin rdmorin commented Feb 1, 2025

In this PR I've added one more important S3 class named bed_data. I also fixed a few bugs that I had introduced in the previous PR. A few (hopefully) inspiring examples that show of the new functionality:

Example 1: maf_data

get_coding_ssm_status will fail to run (by design) when given a MAF object that is from a different genome build than the one specified via the projection parameter:

#all lymphoma genes from bundled NHL gene list
coding_tabulated_df = get_coding_ssm_status()

#this example will fail because hg38 is not supported by this function (yet)
coding_tabulated_df = get_coding_ssm_status(maf_data=
                        get_coding_ssm(projection = "hg38"))
# Error in get_coding_ssm_status(maf_data = get_coding_ssm(projection = "hg38")) : 
# Currently only grch37 projection (hg19 genome build) is supported.

Example 2: bed_data

We can now more consistently and easily make a bed_data object from bed-like data frames. The output will always have the first three columns with the same naming pattern and will ensure that the chromosome prefixing matches the genome build.

#basic usage, adding custom names from bundled ashm data frame
regions_bed = create_bed_data( GAMBLR.data::grch37_ashm_regions,
                          fix_names = "concat",
                          concat_cols = c("gene","region"),
                          sep="-")
                          
# This example intentionally fails
ashm_maf = get_ssm_by_regions(regions_bed = regions_bed,
                              these_samples_metadata = my_meta,
                               projection="hg38")

Example 3: bed_data

The create_bed_data function has a few convenience features that minimize the risk that the user will do something wrong (e.g. it ensures unique names in the name column). This means downstream functions can rely on consistency for ease of programming. These objects also know what genome build it belongs to. For objects in GAMBLR.data, this is inferred from the genome build string that is embedded in the variable name. This protects against accidental mixing of region/coordinates and genome builds with other functions. This example code below shows how this was incorporated into get_ssm_by_regions in preparation for cool_overlaps. Now duplicated "names", weirdly ordered columns, chr-prefix mismatches, or the lack of a column named "name" etc can become a thing of the past!

if(!missing(regions_bed) & "bed_data" %in% class(regions_bed)){
      regions_df = dplyr::select(regions_bed,1:4) %>%
        dplyr::rename(c("Chromosome"="chrom",
                        "Start_Position"="start",
                        "End_Position"="end",
                        "region"="name")) 
    }else{

@Kdreval Kdreval merged commit 7609c4c into master Feb 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants