-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Hello,
Thanks for such a great package!
In the genomic_distribution function, I understand that the expected amount of mutations for a region of interest is calculated as
n_muts / surveyed_length * surveyed_region_length
However, does this proved an accurate estimate when dealing with INDELs? I would not think so since n_muts is not equal to the amount of total mutated bases (such as for SNVs).
Any thoughts on a better way to calculate the expected number of INDELs?
One solution I have tried is to randomly shuffle the INDELs (accounting for sequence context) and then count how many are in the region of interest. When I do this, I get a observed/expected ratio of ~1, which is what I would expect. However, I am confused how then I would calculate if this is significant using the binomial_test function. Would it make sense to do something along these lines?
p = n_INDELs / surveyed_length
n = surveyed_region_length
x = observed_INDELs # number of INDELs observed to land in region of interest from the randomly shuffled files
binomial_test(p, n, x)
Any input would be wonderful, thanks!
Ronnie