Entering edit mode
8.1 years ago
SOHAIL
▴
410
Hi Everyone,
I have BED genome interval files (format: chr startsite endsite) of masking regions for human genome. I want to determine like how much ( amount in percentage) of the human genome is masked by those intervals in BED file?? Is there any quick way to check this??
Please share your useful suggestions...!
Thank you!
Say you are working with
hg19
data and you have two files:hg19.extents.bed
.hg19.masked.bed
The file
hg19.extents.bed
might look like:Whatever these files look like will depend on your genomic build and your data sources.
Given those two inputs, you can run the following one-liner with BEDOPS bedmap:
The file
answer.bed
will contain the extent interval for each chromosome, the number of bases of overlap by masked regions over that extent, and the fraction of extent covered by the masked region (some floating point value between 0 and 1).If you want to turn fractions into percentages, multiply by 100:
If you want the entire genome, not by chromosome, then you could extend the code in the
awk
block just a little:The file
answer.txt
will contain the fraction calculation for the entire genome. Multiply by 100 to get a percentage value.Note: A "sorted" BED file is a BED file sorted with BEDOPS sort-bed.
Note: Masked regions can be disjoint or overlapping, if all you want is the full extent to which masked intervals overlap the parent chromosome. They just need to be sorted. The
--bases-uniq
operator ensures counts are unique.Your post is more an answer than a comment ;-)
Although your approach will result in detailed information, wouldn't it be sufficient to merge intervals (in the case overlaps exist), sum up the sizes and divide that by the known size of the genome?
Say you want percentage of some subset of the genome, instead of the entire genome. This approach is general enough to be used for different reference sets, by changing
hg19.extents.bed
to some other set of interest.Generally, you can also merge the masked elements (or whatever) to create a disjoint map set that is piped into the
bedmap
step, using the non-unique--bases
to do counting:For the purposes of writing a pipeline, the performance of this second approach is perhaps a little better, but the first approach is perhaps more readable as a one-liner.
Thanks for clarifying!
Thank you very much @Alex Reynolds, and @ WouterDeCoster for your kind suggestions.. :) worked pretty well for me..