-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
I noticed using bedshift recently that the algorithm seems to produce regions that are invalid. Specifically, it shifted my bedfile to create a region with a start that occurred after the end (i.e. chr1 1400 900). bedtools was throwing errors because of this.
I saw this happen after I did repeated rounds of shifting on a single bedfile so I'm not sure if this is because of that.
To create the shifted bed files, this is what I ran (I can provide filesm GitHub doesn't let you upload bedfiles):
# %%
import os
from tqdm import tqdm
from bedshift import Bedshift
# %%
# params
ADD_RATE = 0.01
ADD_MEAN = 320.0
ADD_STDEV = 20.0
SHIFT_RATE = 0.01
SHIFT_MEAN = -10.0
SHIFT_STDEV = 120.0
CUT_RATE = 0.01
MERGE_RATE = 0.01
DROP_RATE = 0.03
N_SHIFTS = 100
OUT_DIR = "shifted"
# %%
# create bedshifter object on original bed file
bedshifter = Bedshift("pbmcs.bed", "hg38.chrom.sizes")
for shift_num in tqdm(range(N_SHIFTS), total=N_SHIFTS):
file_name = f"shifted_{shift_num}.bed"
file_path = os.path.join(OUT_DIR, file_name)
bedshifter.all_perturbations(
addrate=ADD_RATE, # the rate (as a proportion of the total number of regions) to add regions
addmean=ADD_MEAN, # the mean length of added regions
addstdev=ADD_STDEV, # the standard deviation of the length of added regions
shiftrate=SHIFT_RATE, # the rate to shift regions (both the start and end are shifted by the same amount)
shiftmean=SHIFT_MEAN, # the mean shift distance
shiftstdev=SHIFT_STDEV, # the standard deviation of the shift distance
cutrate=CUT_RATE, # the rate to cut regions into two separate regions
mergerate=MERGE_RATE, # the rate to merge two regions into one
droprate=DROP_RATE, # the rate to drop/remove regions
seed=42,
)
bedshifter.to_bed(file_path)
# start at the file we just created
bedshifter = Bedshift(file_path, "hg38.chrom.sizes")Then I analyzed overlaps with bedtools with (one directory up):
#!/bin/bash
mkdir -p results/
rm results/intersect.csv
touch results/intersect.csv
original_file="data/pbmcs.bed"
# add baseline to results
total=$(wc -l < "$original_file")
n_overlaps=$(bedtools intersect -wa -a "$original_file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
echo "pbmcs.bed,$percentage" >> results/intersect.csv
for file in data/shifted_clean/*.bed; do
filename=$(basename "$file")
# compute num intersections and store
total=$(wc -l < "$file")
n_overlaps=$(bedtools intersect -u -a "$file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
echo "Processing $file"
echo "$filename,$percentage" >> results/intersect.csv
doneMetadata
Metadata
Assignees
Labels
No labels