How to Collapse Bedtools reciprocal overlap regions
3
0
Entering edit mode
9.1 years ago
hellbio ▴ 520

Dear all,

I am aiming to find 70 % reciprocal overlapping sites and collapse them into a single non-overlapping site list.

However, it seems there is a little more tweaking that needs to be done to collapse into single site list after finding the reciprocal overlap calls.

I have used bedtools intersect in the below way:

Example sites: cnv.bed

chr1    353    6405
chr1    355    6389
chr1    501    6401
chr1    549    6447
chr1    1812    28093
chr1    3286    6382
chr1    3694    6428
chr1    3695    6413
chr1    3729    6677
chr1    4084    6380

/bedtools2/bin/intersectBed -a cnv.bed -b cnv.bed -f 0.7 -r -wa -wb  | head

chr1    353    6405    chr1    353    6405
chr1    353    6405    chr1    355    6389
chr1    353    6405    chr1    501    6401
chr1    353    6405    chr1    549    6447
chr1    355    6389    chr1    353    6405
chr1    355    6389    chr1    355    6389
chr1    355    6389    chr1    501    6401
chr1    355    6389    chr1    549    6447
chr1    501    6401    chr1    353    6405
chr1    501    6401    chr1    355    6389
chr1    501    6401    chr1    501    6401
chr1    501    6401    chr1    549    6447

which lists the sites with 70% reciprocal overlap by comparing the sites with each possible pair.

From here we need to collapse those overlapping sites into a single site list. i.e. to remove the redundant regions keeping only one region which is representative of the overlapping regions. Would you suggest us how to achieve this.

bedtools • 8.3k views
ADD COMMENT
1
Entering edit mode
9.1 years ago

I think awk would be a good approach to this, as it is great for processing columnar data. I'm not on a computer with awk at the moment, so I can't verify that the syntax will be perfect, but... awk 'BEGIN{OFS="\t"};{if ($1==chrA && $2==startA && $3==endA){endB=$6} else {print chrA,startA,endA,chrB,startB,endB;chrA=$1;startA=$2;endA=$3;chrB=$4;startB=$5;endB=$6}}; END{print chrA,startA,endA,chrB,startB,endB}' boogens.txt.

This requires that the file be sorted by chromosome, then by start and end position (first for file A, then file B). basically, for each line, if columns 1-3 don't match the previous line's columns 1-3 (stored as chrA,startA,endA), then first print out the stored values (prints an empty line in the first iteration, could put an (if NR==1) check if that bothers you), then store the values. If the columns do match the previous line's columns 1-3 (so it is the same peak in the first file), then it only updates the end coordinate. I think this should work (barring a syntax error), with a little tweaking.

EDIT:Sorry, I forgot a pair of braces in the main statement (surrounding the if/else statement). More importantly, I changed "...&&$3==startB..." to "...&&$3==endA..." in the if statement, because that's what it needs to be. My bad.

ADD COMMENT
0
Entering edit mode

it looks like the code takes boogens.txt which is basically one input file. The text after that say sort first for fileA and fileB. I wonder what is the input to the command. Should the input be:

chr1    353    6405
chr1    355    6389
chr1    501    6401
chr1    549    6447
chr1    1812    28093
chr1    3286    6382
chr1    3694    6428
chr1    3695    6413
chr1    3729    6677
chr1    4084    6380
ADD REPLY
0
Entering edit mode

Sorry, didn't make that clear. boogens.txt would be your output from intersectBed, like you had as the second output file in your initial question, which has six columns:columns 1-3 are a peak from file 1 (possibly repeated over multiple lines, if multiple matches in file 2), and columns 4-6 are the matching peaks in file 2.

ADD REPLY
0
Entering edit mode

I used the output from intersectBed as input to the awk:

awk 'BEGIN{OFS="\t"}{if ($1==chrA && $2==startA && $3==startB){endB=$6};else {print chrA,startA,endA,chrB,startB,endB;chrA=$1;startA=$2;endA=$3;chrB=$4;startB=$5;endB=$6}};END{print chrA,startA,endA,chrB,startB,endB}' intersectBedoutput > output.txt

The output and input are similar. Did I use the code in a wrong way or did I tweaked it wrong?

ADD REPLY
0
Entering edit mode

My mistake, I made a syntax error and a logic error. $3==startB needs to be $3==endA. Sorry about that. It seemed to work for me, where I get three lines as output:

chr1    353     6405    chr1    353     6447
chr1    355     6389    chr1    353     6447
chr1    501     6401    chr1    353     6447

Is that the output you would expect?

ADD REPLY
0
Entering edit mode

Yes, it seems to have the desired output.

ADD REPLY
1
Entering edit mode
9.1 years ago

Given a sorted input.bed and BEDOPS, one could collapse mapped elements with a post-bedmap *merge step with *awk and bedops --merge. Finally, use paste to glue the original regions to the merged map result:

$ bedmap --echo-map --fraction-both 0.7 input.bed \
    | awk '{ \
          gsub(/;/,"\n"); \
          system("echo -e \"" $0 "\" | bedops --merge - "); \
      }' - \
    | paste input.bed -
chr1    353     6405     chr1    353     6447
chr1    355     6389     chr1    353     6447
chr1    501     6401     chr1    353     6447
chr1    549     6447     chr1    353     6447
chr1    1812    28093    chr1    1812    28093
chr1    3286    6382     chr1    3286    6677
chr1    3694    6428     chr1    3286    6677
chr1    3695    6413     chr1    3286    6677
chr1    3729    6677     chr1    3286    6677
chr1    4084    6380     chr1    3286    6677

It's all using standard I/O streams, so it should run pretty fast.

ADD COMMENT
0
Entering edit mode

It seems to work as expected. The idea is to find reciprocal overlapping regions with 70%. After having those reciprocal overlapping clusters, from each cluster we would need to generate boundaries (start and end) that represent the cluster. Therefore, I suppose from your output extracting unique regions based on column 4,5 and 6 would give the boundaries for each cluster. Please correct me if am wrong.

ADD REPLY
0
Entering edit mode

That's correct -- the bounds of mapped regions would be the merged elements that are output from a bedops --merge operation.

ADD REPLY
0
Entering edit mode

I just want to convey the above task in biological terms. The overall idea is to merge overlapping CNVs across individuals/within individual with 70% reciprocal overlap into a single site list. And i believe the above solution addresses the same.

ADD REPLY
0
Entering edit mode

I am trying to use this solution and there seems to be something wrong within awk; I am not getting the desired output as shown above for the same 10 sequences.

$ bedmap --echo-map --fraction-both 0.7 input.bed \ | awk '{ \ gsub(/;/,"\n"); \ system("echo -e \"" $0 "\" | bedops --merge - "); \ }' - \ | paste input.bed -

chr1 353 6405 -e 1 0

chr1 355 6389 hr1 353 6405

chr1 501 6401 chr1 355 6447

chr1 549 6447 -e 1 0

chr1 1812 28093 hr1 353 6405

chr1 3286 6382 chr1 355 6447

chr1 3694 6428 -e 1 0

chr1 3695 6413 hr1 353 6405

chr1 3729 6677 chr1 355 6447

chr1 4084 6380 -e 1 0

Can someone please help me resolve the issue?

ADD REPLY
0
Entering edit mode

You might try this, if you are using Mac OS X:

$ bedmap --echo-map --fraction-both 0.7 input.bed | gawk '{ gsub(/;/,"\n"); system("echo \"" $0 "\" | sort-bed - | bedops --merge - | sort-bed - "); }' | paste input.bed -
chr1    353 6405    chr1    353 6447
chr1    355 6389    chr1    353 6447
chr1    501 6401    chr1    353 6447
chr1    549 6447    chr1    353 6447
chr1    1812    28093   chr1    1812    28093
chr1    3286    6382    chr1    3286    6677
chr1    3694    6428    chr1    3286    6677
chr1    3695    6413    chr1    3286    6677
chr1    3729    6677    chr1    3286    6677
chr1    4084    6380    chr1    3286    6677

If you need to install GNU awk:

$ brew install gawk

Via the Homebrew package manager.

ADD REPLY
0
Entering edit mode

I am using it on a Linux system, should I use gawk instead of awk?

ADD REPLY
1
Entering edit mode

I used the updated command using gawk and it worked. Thank you!

ADD REPLY
0
Entering edit mode
9.1 years ago

What about mergeBed, in bedtools? Or am I missing something?

ADD COMMENT
0
Entering edit mode

Merge seems to be a pretty good way to go about this i'd say.

ADD REPLY
0
Entering edit mode

I checked the bedtools merge but could not find if it can do this. For instance in the above example,

chr1    353    6405
chr1    355    6389
chr1    501    6401
chr1    549    6447

These regions appear to have 70% reciprocal overlap as shown by the intersectbed output. Therefore, i would consider to merge these into a single site with the minimum of the start site as new start and max of end site as new end which would then be:

chr1      353       6447

I hope i made it clear now. Any suggestions?

ADD REPLY
0
Entering edit mode

I seem to recall that you want to specify the length required to merge using -d.

-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged.

ADD REPLY
0
Entering edit mode

If I understand it correct, merge does not allow to see for fraction overlap -f and reciprocal overlap -r like in intersectbed. It has only -d argument to set the basepair overlap. The idea is to merge the calls in the cnv.bed file for those which has 70% reciprocal overlap with each other.

ADD REPLY
0
Entering edit mode

you are probably looking for:

bedtools merge -i

https://bedtools.readthedocs.io/en/latest/content/tools/merge.html

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6