Question

Create genome-unique intervals from overlapping records in a bed file

0

Entering edit mode

3.2 years ago

c0tton • 0

Let there exist a bed file, a, with 3 overlapping records:

a

chr1      1      20      s1      1      +      
chr1      5      20      s2      1      +      
chr1      10    20      s3      1      +

I want to write a function which would parse a and return a bed-file, bm containing genome-unique windows along with a list of bed-records which span those windows, i.e.

b

chr1      1      5      s1      1      +      
chr1      5      10      s1,s2,s3      1      +      
chr1      10      20      s1, s2, s3      1      +

The goal of the function is to produce genome-unique bed records (essentially an index), from which combinations of records can be used to produce each of the original bed records.

I wouldn't know where to start with something like this. Any suggestions are greatly appreciated.

bedtools ranges BED genomic • 1.3k views

ADD COMMENT • link 3.2 years ago by c0tton • 0

score 2 · Answer 1 · 2021-09-25

You could use BEDOPS bedops --partition to easily calculate disjoint elements from a set of overlapping elements.

Then you could pass, or pipe the result to bedmap --echo --echo-map-id to get the partitioned intervals and the IDs of elements in a.bed that overlap (or "map" to) the partitioned intervals:

$ bedops --partition a.bed | bedmap --echo --echo-map-id --echo-map --delim '\t' - a.bed | cut -f1-4,9,10 > answer.bed

It's not entirely clear from your example how you want to deal with cases where overlapping elements in a have differences in scores and/or strands, but the above is a simple and fast way to get an answer.

If you don't care about score and strand, the following suffices:

$ bedops --partition a.bed | bedmap --echo --echo-map-id --delim '\t' - a.bed > answer.bed

Your example input is sorted, but your real input might not be. If your starting intervals are unsorted or you don't know their sort order, you can sort with sort-bed to prepare input to use for fast set operations:

$ sort-bed a.unsorted.bed > a.bed

This generates output equivalent to sort -k1,1 -k2,2n -k3,3n, but sort-bed is faster for sorting BED files.

score 1 · Answer 2 · 2021-09-25

 awk '{B=int($2);E=int($3);while(B<E) {printf("%s\t%d\t%d\t%s\n",$1,B,B+1,$4);B++}}' input.bed |\
sort -t $'\t' -k1,1 -k2,2n |\
bedtools groupby -g 1,2,3 -c 4 -o distinct |\
awk '{C=$1;B=int($2);E=int($3);N=$4; if(NR>1 && (C!=PC || B!=PE || N!=PN)) {printf("%s\t%d\t%d\t%s\n",PC,PB,PE,PN); PC=C;PE=E;PB=B;PN=N;} else {PC=C;PE=E;PN=N;} } END {printf("%s\t%d\t%d\t%s\n",PC,PB,PE,PN);}' 

chr1    0   5   s1
chr1    5   10  s1,s2
chr1    10  20  s1,s2,s3

score 0 · Answer 3 · 2021-09-26

Thank you both for your replies.

Here's the final code I used in case anyone else wants to achieve something similar:

# split the input bed by strand 
cat $inputBed | awk '($6 == "+")' > $inputBed.plus
cat $inputBed | awk '($6 == "-")' > $inputBed.minus

# run Alex's code for each stand
time bedops --partition $inputBed.plus | bedmap --echo --echo-map-id --echo-map --delim '\t' - $inputBed.plus| cut -f1-3 | awk '{print $1, $2, $3, "node_+_"NR, "1", "+"}' OFS="\t"  > nodes.plus.bed
time bedops --partition $inputBed.minus | bedmap --echo --echo-map-id --echo-map --delim '\t' - $inputBed.plus| cut -f1-3 | awk '{print $1, $2, $3, "node_-_"NR, "1", "-"}' OFS="\t" > nodes.minus.bed

# filter out records where start is less than end 
cat nodes.plus.bed nodes.minus.bed | awk '($3 > $2){print $0}' | bedtools sort -i - > nodes.merged.bed

Thank you both so much for your help!