Stats on the overlapping length within two bed files
2
0
Entering edit mode
6.7 years ago
beanavarro85 ▴ 10

Hi all,

I have two bed files, and I have run bedtools intersect to calculate the overlapping regions.

bedtools intersect -a fileA.bed -b fileB.bed -wao  > overlaps.bed

I need to calculate the number of overlaps that are longer than 100 bp (i.e. value higher than 100 in the 8th column of the overlaps.bed file, if I am right) per each feature in fileA.bed, as well as the total length covered by those overlaps.

Any ideas?

Thanks

bedtools bedops genome • 3.5k views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time. Formatting bar

ADD REPLY
0
Entering edit mode

Thanks! I will next time, sorry about that.

ADD REPLY
1
Entering edit mode
6.7 years ago

Via BEDOPS and awk:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }'

The bedmap command takes all overlaps between map.bed and an element in reference.map, and applies the --echo-overlap-size operator, which prints the sizes of all overlapping elements.

The awk command filters overlap sizes by lengths greater than 100, sums their sizes and prints the sum and the number of filtered elements.

ADD COMMENT
0
Entering edit mode

Awesome!

Thanks, it worked perfectly.

Since the order won't change, I have pasted this two columns to my reference.bed file like this:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' > tmp
$ paste reference.bed tmp
ADD REPLY
0
Entering edit mode

Cool. Here's a modified version that takes out a step of making the intermediate file tmp, which would help speed it up some more and eliminate generating tmp:

$ bedmap --echo-overlap-size reference.bed map.bed | awk '{ s=0; t=0; n=split($0, a, ";"); for(i = 1; i <= n; i++) { if(a[i] > 100) { s += a[i]; t++; } } printf("%d %d\n", s, t);  }' - | paste reference.bed - > answer.bed

The hyphen - is a common placeholder for standard input from upstream processes. This placeholder can be used with BEDOPS tools like bedops, bedmap etc. as well as with common Unix utilities, like awk, paste, gunzip, etc.

Piping input from one process to the next in this way is a good way to skip making intermediate files, which can take time to generate and unnecessarily use up disk space.

ADD REPLY
0
Entering edit mode

I was sure there was a way to do it in one line, but didn't know how!

ADD REPLY
0
Entering edit mode
6.7 years ago

Here's another way that's a little simpler, perhaps:

$ bedmap --bp-ovr 100 --count --echo-overlap-size --delim "\t" reference.bed map.bed | awk '{ s=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if(a[i]>=100) { s+=a[i]; } print s"\t"$1; }'

The --bp-ovr 100 operator in bedmap requires 100 bases of overlap between elements in map.bed and the element in reference.bed. The --echo-overlap-size operator behaves as described in my previous answer.

The resulting overlap sizes are then passed to awk, which prints the count of qualifying overlaps and the sum of their sizes.

ADD COMMENT

Login before adding your answer.

Traffic: 1779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6