Hi,
I have read quite a few posts here about coverage already. But I still had a few questions. I have a BAM file I'm trying to find the coverage of it (typically like say 30X).
So, I decided to use genomeCoverageBed
for my analysis. And I used the following command:
genomeCoverageBed -ibam file.bam -g ~/refs/human_g1k_v37.fasta > coverage.txt
As many are aware, the output of the file looks something like this:
genome 0 26849578 100286070 0.26773
genome 1 30938928 100286070 0.308507
genome 2 21764479 100286070 0.217024
genome 3 11775917 100286070 0.117423
genome 4 5346208 100286070 0.0533096
genome 5 2135366 100286070 0.0212927
genome 6 785983 100286070 0.00783741
genome 7 281282 100286070 0.0028048
genome 8 106971 100286070 0.00106666
genome 9 47419 100286070 0.000472837
genome 10 27403 100286070 0.000273248
To find the coverage, I multiplied col2
(depth) with col3
(number of bases in genome with that depth) and then summed the entire column. Then, I divided it by genome length to get the coverage. In this case, col2 * col3
is:
0
30938928
43528958
35327751
21384832
10676830
4715898
1968974
855768
426771
274030
And the sum is: 150098740
. Since the genome length is 100286070
, the coverage is 150098740/100286070 = 1.5
. That is to say it is, 1.5X. I have only considered the first 10 depth from the file here, but you get the idea. So, is this the right way to get the physical coverage?
Note:
While the output file (coverage.txt) gives individual chromosome details, I only took the genome details i.e., col1 labeled as genome.
Does this logic apply to exomes too? Like say, human exome is 30MB I think. And I have 30 million mapped reads, with 32bp reads. Then, the average coverage would be 30X right?
yes, to get the idealized coverage you would divide with the total length of the DNA that is supposedly the source of the reads.