Question

Bedtools, report feature with highest overlap if overlaps two features

0

Entering edit mode

8.6 years ago

murphy.charlesj ▴ 100

I have two bed files, one containing non-overlapping windows of 10,000bp of the genome, and the other containing the capture regions for whole-exon sequencing. I want to know (for each 10,000bp window) which capture regions it overlaps. And in the case where a capture region overlaps two 10,000bp windows, just report the one with highest overlap.

I've been playing around with bedtools to do this via commands like,

bedtools closest -a a.bed -b b.bed

But this reports the capture region multiple times in the event it overlaps two 10,000bp windows. I've also tried using the -first and -last option, but they do not report the capture region with highest overlap. Here are the two example bed files and output:

a.bed

b.bed

Output (from the command above):

1   1   10  1   4   9
1   11  20  1   11  14
1   11  20  1   15  24
1   21  30  1   15  24
1   21  30  1   29  39
1   31  40  1   29  39

genome • 3.8k views

ADD COMMENT • link updated 8.6 years ago by Alex Reynolds 36k • written 8.6 years ago by murphy.charlesj ▴ 100

0

Entering edit mode

I think you should check intersectBed option instead of closestBed. Also check groupBy from BedTools to find the regions with highest overlap.

ADD REPLY • link 8.6 years ago by venu 7.1k

0

Entering edit mode

Thanks for the tip. I tried the following command and it seems to work:

bedtools intersect -a a.bed -b b.bed -wao | bedtools groupby -g 4,5,6 -c 7 -o max -full

The output:

1   1   10  1   4   9   5   5
1   11  20  1   11  14  3   3
1   11  20  1   15  24  5   5
1   21  30  1   29  39  1   8

EDIT: oops, actually this does not seem to work, let me tinker a bit more

ADD REPLY • link 8.6 years ago by murphy.charlesj ▴ 100

score 3 · Accepted Answer · 2016-12-14

You could do this with a couple bedmap operations:

$ bedmap --count --echo --echo-map 1.bed 2.bed \
    | awk -F'|' '($1==1)' \
    | cut -d'|' -f2- \
    | tr '|' '\t' \
    > single_overlaps.bed

$ bedmap --count --echo --echo-map --echo-overlap-size 1.bed 2.bed \
    | awk -F'|' '($1==2)' \
    | cut -d'|' -f2- \
    | awk -F'|' '{ 
        s = split($3, sizes, ";"); \
        o = split($2, overlaps, ";"); \
        x = 0; \
        y = 0; \ 
        for (i = 1; i <= s; i++) { \
            if (x <= sizes[i]) { \
                x = sizes[i]; \
                y = i; \
            } \
        } \
        print $1"\t"overlaps[y]; \
    }' > max_of_paired_overlaps.bed

$ bedops --everything single_overlaps.bed max_of_paired_overlaps.bed > answer.bed

Ties go to downstream elements.

score 1 · Accepted Answer · 2016-12-14

1

Entering edit mode

8.6 years ago

murphy.charlesj ▴ 100

Never mind, I'll just do this using the IRanges R package via the overlap() function. I was initially hesitant to use R for this out of fear that it would be slow, but the overlap() function is pretty fast.

ADD COMMENT • link 8.6 years ago by murphy.charlesj ▴ 100