Thanks for the reply! I want to simply calculate total number of bases that are within 500000 interval and calculate the percentage. Anything beyond that 500000 should be added to the subsequent interval.
ADD REPLY
• link
updated 3.3 years ago by
Ram
45k
•
written 10.5 years ago by
arnstrm
★
1.9k
Create a random BED file of 1,000,000 intervals each 100bp in size. As an example - it looks like you already have your files ready.
bedtools random -n 1000000 -l 100 -g genomes/human.hg19.genome > random.bed
Calculate the coverage of the random BED records on the 500Kb windows. The first 3 columns are the window itself, the 4th is the count of BED records that overlapped the window, the fifth is the total number of base pairs in the window with coverage, the 6th is the size of the window, and the 7th is what you are after: the fraction of bases in the window with coverage.
BEDOPS would work well for this, along with standard UNIX utilities.
Let's start with an extents file that defines the bounds of our chromosomes, say for hg19:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Next, we use BEDOPS bedops --partition to split your input regions (regions.bed) that span across bins. We can chain together a series of BEDOPS operations that calculate the number of overlapping bases across now-correctly partitioned regions:
The file final_answer.bed contains all the bins in BED format, with six columns: bin chromosome, bin start, bin stop, bin index, the number of bases from regions.bed that overlap that bin, and the percentage of overlap.
The chain of commands shown above might look a little overwhelming, but you can use UNIX tee in between steps so you can examine the output being passed to the next command. This is a great way to learn the command line and particularly standard input and output management, which I think you'll find is a critical part of doing bioinformatics.
BEDOPS is really useful for this task - you can pass in standard input and route standard output between BEDOPS tools like bedops and bedmap, and UNIX utilities like grep, cut and awk. After you use BEDOPS for a while, it almost becomes an expressive language for dealing with set operations of this kind.
Finally, the --partition option in bedops is a powerful piece of kit, which lets you quickly cut input regions that span bins, among other uses.
Just this morning I was thinking how less I use of awk, and you've just shown me the next step to take as well as the next tool to get conversant with (bedops). Thank you!
Actually I noticed this approach and decided that the way bins are split up here are more appropriate towards the heat maps I want to plot of distribution, did it on the whole mm10 scale and weirdly got: awk: (FILENAME=- FNR=1) fatal: division by zero attempted
One other thing I wanted to mention is that the hg19.extents.bed file shown above actually is in the format of the hg19.bed file when pulled with fetchchromsizes.
This of course assumed that each block has some data in the BED file. You might wanna address scenarios where an entire block is skipped. You can do that by adding a condition to check if array[1]> stop && array[1]< stop + block_size
I'd do it but I don't wanna end up writing the program for you :-)
have a look at bedtools.
I actually looked at it, but there are several of them (tools). Can you give me an overview/idea for how to use bedtools for this? Thanks!
How do you need to handle the case where an input range straddles bounds between two adjacent 500k intervals?
Thanks for the reply! I want to simply calculate total number of bases that are within 500000 interval and calculate the percentage. Anything beyond that 500000 should be added to the subsequent interval.