Bedtools, report feature with highest overlap if overlaps two features
2
0
Entering edit mode
8.1 years ago

I have two bed files, one containing non-overlapping windows of 10,000bp of the genome, and the other containing the capture regions for whole-exon sequencing. I want to know (for each 10,000bp window) which capture regions it overlaps. And in the case where a capture region overlaps two 10,000bp windows, just report the one with highest overlap.

I've been playing around with bedtools to do this via commands like,

bedtools closest -a a.bed -b b.bed

But this reports the capture region multiple times in the event it overlaps two 10,000bp windows. I've also tried using the -first and -last option, but they do not report the capture region with highest overlap. Here are the two example bed files and output:

a.bed

1   1   10
1   11  20
1   21  30
1   31  40

b.bed

1   4   9
1   11  14
1   15  24
1   21  31

Output (from the command above):

1   1   10  1   4   9
1   11  20  1   11  14
1   11  20  1   15  24
1   21  30  1   15  24
1   21  30  1   29  39
1   31  40  1   29  39
genome • 3.5k views
ADD COMMENT
0
Entering edit mode

I think you should check intersectBed option instead of closestBed. Also check groupBy from BedTools to find the regions with highest overlap.

ADD REPLY
0
Entering edit mode

Thanks for the tip. I tried the following command and it seems to work:

bedtools intersect -a a.bed -b b.bed -wao | bedtools groupby -g 4,5,6 -c 7 -o max -full

The output:

1   1   10  1   4   9   5   5
1   11  20  1   11  14  3   3
1   11  20  1   15  24  5   5
1   21  30  1   29  39  1   8

EDIT: oops, actually this does not seem to work, let me tinker a bit more

ADD REPLY
3
Entering edit mode
8.1 years ago

You could do this with a couple bedmap operations:

$ bedmap --count --echo --echo-map 1.bed 2.bed \
    | awk -F'|' '($1==1)' \
    | cut -d'|' -f2- \
    | tr '|' '\t' \
    > single_overlaps.bed

$ bedmap --count --echo --echo-map --echo-overlap-size 1.bed 2.bed \
    | awk -F'|' '($1==2)' \
    | cut -d'|' -f2- \
    | awk -F'|' '{ 
        s = split($3, sizes, ";"); \
        o = split($2, overlaps, ";"); \
        x = 0; \
        y = 0; \ 
        for (i = 1; i <= s; i++) { \
            if (x <= sizes[i]) { \
                x = sizes[i]; \
                y = i; \
            } \
        } \
        print $1"\t"overlaps[y]; \
    }' > max_of_paired_overlaps.bed

$ bedops --everything single_overlaps.bed max_of_paired_overlaps.bed > answer.bed

Ties go to downstream elements.

ADD COMMENT
1
Entering edit mode

Thanks, this solution does work.

ADD REPLY
1
Entering edit mode
8.1 years ago

Never mind, I'll just do this using the IRanges R package via the overlap() function. I was initially hesitant to use R for this out of fear that it would be slow, but the overlap() function is pretty fast.

ADD COMMENT

Login before adding your answer.

Traffic: 3277 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6