Bedshift seems to produce invalid regions

I noticed using `bedshift` recently that the algorithm seems to produce regions that are *invalid*. Specifically, it shifted my bedfile to create a region with a start that occurred after the end (i.e. `chr1    1400    900`). `bedtools` was throwing errors because of this.

I saw this happen after I did repeated rounds of shifting on a single bedfile so I'm not sure if this is because of that.

**To create the shifted bed files, this is what I ran (I can provide filesm GitHub doesn't let you upload bedfiles):**
```python
# %%
import os
from tqdm import tqdm
from bedshift import Bedshift

# %%
# params
ADD_RATE = 0.01
ADD_MEAN = 320.0
ADD_STDEV = 20.0
SHIFT_RATE = 0.01
SHIFT_MEAN = -10.0
SHIFT_STDEV = 120.0
CUT_RATE = 0.01
MERGE_RATE = 0.01
DROP_RATE = 0.03
N_SHIFTS = 100
OUT_DIR = "shifted"

# %%
# create bedshifter object on original bed file
bedshifter = Bedshift("pbmcs.bed", "hg38.chrom.sizes")

for shift_num in tqdm(range(N_SHIFTS), total=N_SHIFTS):
    file_name = f"shifted_{shift_num}.bed"
    file_path = os.path.join(OUT_DIR, file_name)

    bedshifter.all_perturbations(
        addrate=ADD_RATE,  # the rate (as a proportion of the total number of regions) to add regions
        addmean=ADD_MEAN,  # the mean length of added regions
        addstdev=ADD_STDEV,  # the standard deviation of the length of added regions
        shiftrate=SHIFT_RATE,  #  the rate to shift regions (both the start and end are shifted by the same amount)
        shiftmean=SHIFT_MEAN,  #  the mean shift distance
        shiftstdev=SHIFT_STDEV,  #  the standard deviation of the shift distance
        cutrate=CUT_RATE,  #  the rate to cut regions into two separate regions
        mergerate=MERGE_RATE,  #  the rate to merge two regions into one
        droprate=DROP_RATE,  # the rate to drop/remove regions
        seed=42,
    )

    bedshifter.to_bed(file_path)

    # start at the file we just created
    bedshifter = Bedshift(file_path, "hg38.chrom.sizes")

```

**Then I analyzed overlaps with `bedtools` with (one directory up):**

```sh
#!/bin/bash

mkdir -p results/
rm results/intersect.csv
touch results/intersect.csv

original_file="data/pbmcs.bed"

# add baseline to results
total=$(wc -l < "$original_file")
n_overlaps=$(bedtools intersect -wa -a "$original_file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)

echo "pbmcs.bed,$percentage" >> results/intersect.csv

for file in data/shifted_clean/*.bed; do
    filename=$(basename "$file")

    # compute num intersections and store
    total=$(wc -l < "$file")
    n_overlaps=$(bedtools intersect -u -a "$file" -b "$original_file" | wc -l)
    percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
    
    echo "Processing $file"
    echo "$filename,$percentage" >> results/intersect.csv

done
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bedshift seems to produce invalid regions #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bedshift seems to produce invalid regions #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions