Determine the percentage of the human genome covered by the BED interval file
1
0
Entering edit mode
8.1 years ago
SOHAIL ▴ 410

Hi Everyone,

I have BED genome interval files (format: chr startsite endsite) of masking regions for human genome. I want to determine like how much ( amount in percentage) of the human genome is masked by those intervals in BED file?? Is there any quick way to check this??

Please share your useful suggestions...!

Thank you!

ngs genome • 3.8k views
ADD COMMENT
3
Entering edit mode

Say you are working with hg19 data and you have two files:

  1. Chromosome bounds in a sorted file called hg19.extents.bed.
  2. Masked regions in a sorted file called hg19.masked.bed

The file hg19.extents.bed might look like:

$ cat hg19.extents.bed 
chr1    0   249250621
chr10   0   135534747
chr11   0   135006516
chr12   0   133851895
...

Whatever these files look like will depend on your genomic build and your data sources.

Given those two inputs, you can run the following one-liner with BEDOPS bedmap:

$ bedmap --echo --bases-uniq --delim '\t' hg19.extents.bed hg19.masked.bed | awk '{ print $0"\t"$4/($3 - $2); }' - > answer.bed

The file answer.bed will contain the extent interval for each chromosome, the number of bases of overlap by masked regions over that extent, and the fraction of extent covered by the masked region (some floating point value between 0 and 1).

If you want to turn fractions into percentages, multiply by 100:

$ bedmap --echo --bases-uniq --delim '\t' hg19.extents.bed hg19.masked.bed | awk '{ print $0"\t"100*$4/($3 - $2); }' - > answer.perc.bed

If you want the entire genome, not by chromosome, then you could extend the code in the awk block just a little:

$ bedmap --echo --bases-uniq --delim '\t' hg19.extents.bed hg19.masked.bed | awk 'BEGIN { genome_length = 0; masked_length = 0; } { genome_length += ($3 - $2); masked_length += $4; } END { print (masked_length / genome_length); }' - > answer.txt

The file answer.txt will contain the fraction calculation for the entire genome. Multiply by 100 to get a percentage value.

Note: A "sorted" BED file is a BED file sorted with BEDOPS sort-bed.

Note: Masked regions can be disjoint or overlapping, if all you want is the full extent to which masked intervals overlap the parent chromosome. They just need to be sorted. The --bases-uniq operator ensures counts are unique.

ADD REPLY
0
Entering edit mode

Your post is more an answer than a comment ;-)

Although your approach will result in detailed information, wouldn't it be sufficient to merge intervals (in the case overlaps exist), sum up the sizes and divide that by the known size of the genome?

ADD REPLY
2
Entering edit mode

Say you want percentage of some subset of the genome, instead of the entire genome. This approach is general enough to be used for different reference sets, by changing hg19.extents.bed to some other set of interest.

Generally, you can also merge the masked elements (or whatever) to create a disjoint map set that is piped into the bedmap step, using the non-unique --bases to do counting:

$ bedops --merge hg19.masked.bed | bedmap --echo --bases --delim '\t' hg19.extents.bed - | awk '{ print $0"\t"$4/($3 - $2); }' - > answer.bed

For the purposes of writing a pipeline, the performance of this second approach is perhaps a little better, but the first approach is perhaps more readable as a one-liner.

ADD REPLY
0
Entering edit mode

Thanks for clarifying!

ADD REPLY
1
Entering edit mode

Thank you very much @Alex Reynolds, and @ WouterDeCoster for your kind suggestions.. :) worked pretty well for me..

ADD REPLY
0
Entering edit mode
8.1 years ago

bedtools intersect may help. Check this link.

ADD COMMENT
0
Entering edit mode

Thank you very much @Alex Reynolds, and @ WouterDeCoster for your kind suggestions.. :) worked pretty well for me..

ADD REPLY

Login before adding your answer.

Traffic: 2246 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6