I have a bedfile with multiple mapped sequences like below:
chr1 100 200 region1 +
chr1 180 300 region2 +
chr1 300 350 region3 +
chr1 320 400 region4 +
chr1 330 410 region5 +
I am using bedtools merge to merge sequences that overlap - however, I really need to set a minimum overlap cut-off. Bedtools merge will overlap bookended sequences, and sequences overlapping by just one basepair, and there is no way to change this.
E.g. regions 1 and 2 overlap by only 20bp, and in this case I would not like them merged. However, regions 4 and 5 mostly overlap, and I would like these sequences to be merged. Its also really important that I am able to retain the ID and strand information in the resulting merged bed entry, as well as enforcing strandedness when merging.
Does anyone know of a program or method to do this? I have also tried bedOPS with no success. I am new to Python programming so not quite sure how I would be able to code this myself!
Not sure if you wanting to do like this-
Note that, in place of
sort-bed - | uniq -
, you can pass the--unique
option tosort-bed
.The input is not strictly a BED6 file, but you can strand-separate it by way of
awk
:Repeat for reverse-stranded elements:
May as well sort, if the sort order of
input.bed
is unknown.Then run BEDOPS
bedmap
on each file, as usual.You can use
bedops --everything
to take the multiset union of the twobedmap
results:The file
answer.bed
will be sorted correctly.To preserve IDs in the fourth column, you can use the
--echo-map-id
or--echo-map-id-uniq
operator withbedmap
.