Question

Bedtools merge minimum overlap?

0

Entering edit mode

8 months ago

SJP • 0

I have a bedfile with multiple mapped sequences like below:

chr1    100     200     region1    +
chr1    180     300     region2    +
chr1    300     350     region3    +
chr1    320     400     region4    +
chr1    330     410     region5    +

I am using bedtools merge to merge sequences that overlap - however, I really need to set a minimum overlap cut-off. Bedtools merge will overlap bookended sequences, and sequences overlapping by just one basepair, and there is no way to change this.

E.g. regions 1 and 2 overlap by only 20bp, and in this case I would not like them merged. However, regions 4 and 5 mostly overlap, and I would like these sequences to be merged. Its also really important that I am able to retain the ID and strand information in the resulting merged bed entry, as well as enforcing strandedness when merging.

Does anyone know of a program or method to do this? I have also tried bedOPS with no success. I am new to Python programming so not quite sure how I would be able to code this myself!

bedtools genomics • 691 views

ADD COMMENT • link updated 8 months ago by Alex Reynolds 36k • written 8 months ago by SJP • 0

1

Entering edit mode

Not sure if you wanting to do like this-

cat your_input.bed 
chr1    100     200     region1    +
chr1    180     300     region2    +
chr1    300     350     region3    +
chr1    320     400     region4    +
chr1    330     410     region5    +

#BEDOPS bedmap to merge beds if there greater than 20% overlaps between beds:
bedmap --count --echo-map-range --fraction-both 0.2 --delim "\t" your_input.bed  |cut -f2- -|sort-bed -|uniq -
chr1    100 200
chr1    180 300
chr1    300 410

ADD REPLY • link 8 months ago by bk11 ★ 3.0k

0

Entering edit mode

Note that, in place of sort-bed - | uniq -, you can pass the --unique option to sort-bed.

The input is not strictly a BED6 file, but you can strand-separate it by way of awk:

awk -vFS="\t" -vOFS="\t" '($5=="+")' input.bed | sort-bed - > output.forward.bed

Repeat for reverse-stranded elements:

awk -vFS="\t" -vOFS="\t" '($5=="-")' input.bed | sort-bed - > output.reverse.bed

May as well sort, if the sort order of input.bed is unknown.

Then run BEDOPS bedmap on each file, as usual.

You can use bedops --everything to take the multiset union of the two bedmap results:

bedops --everything answer.forward.bed answer.reverse.bed > answer.bed

The file answer.bed will be sorted correctly.

To preserve IDs in the fourth column, you can use the --echo-map-id or --echo-map-id-uniq operator with bedmap.

ADD REPLY • link 8 months ago by Alex Reynolds 36k

score 0 · Answer 1 · 2024-05-02

Bedtools intersect allows you to specify the fraction of overlap between two BED (or BAM) files using the F/f/r flags. You could split your sort-ordered BED file into two files of alternating entries, intersect per your specifications to generate a list of intervals to be merged, perform Bedtools merge on that subset, add back the unmerged intervals, and sort.

Inelegant solution, but it does the job.