I want to calculate the number of reads mapping to regions of the genome, and I want to group these reads into bins of say 1kb. I have used bedtools genomecov to generate a bedgraph file that shows the coverage of reads across the genome.
bedtools genomecov -bg -ibam file.bam >Coverage.bedgraph
I then merged nearby regions of coverage using:
bedtools merge -d 1000 -c 4 -o sum -i Coverage.bedgraph > Coverage.merged.bedgraph
The problem is that the "sum" part adds the coverage from the different regions, so when merged, reads are counted more than once.
Is there a way to merge the regions without counting reads more than once? I think the necessary information from the bam file is probably lost at this point. Is there a way to do this with an option in bedtools genomecov i.e. can you make it report regions of coverage with a specified 'bin' size?
I think deeptools bamCoverage has the option "--binSize", which seems like what you want. But I was still questioning if it can solve the "counting reads several times" problem since if a read span two bins, it will still be counted twice....
It may still count it twice, but I guess the larger the bin, the smaller this issue becomes. Thanks, I'll give this a go.
This is great. The reads are occasionally still counted twice, but the bins are large enough that this isn't a big issue. Thanks!
Just take any other operation than
sum
that you find appropriate. Mean, median, min, max, choice is yours.I'm not sure what those operations do. I couldn't really understand from the bedtools documentation.
No magic here, min is the minimum, max the maximum, median the median and so on. So if you bin spans three intervals with coverage 3, 5 and 9 then median would report 5 for the bin, max 9 and min 3.