Getting The Average Coverage From The Coverage Counts At Each Depth.
2
1
Entering edit mode
11.6 years ago
Jordan ★ 1.3k

Hi,

I have read quite a few posts here about coverage already. But I still had a few questions. I have a BAM file I'm trying to find the coverage of it (typically like say 30X).

So, I decided to use genomeCoverageBed for my analysis. And I used the following command:

genomeCoverageBed -ibam file.bam -g ~/refs/human_g1k_v37.fasta > coverage.txt

As many are aware, the output of the file looks something like this:

genome    0    26849578    100286070 0.26773
genome    1    30938928    100286070     0.308507
genome    2    21764479    100286070    0.217024
genome    3    11775917    100286070    0.117423
genome    4    5346208    100286070    0.0533096
genome    5    2135366    100286070    0.0212927
genome    6    785983    100286070    0.00783741
genome    7    281282    100286070    0.0028048
genome    8    106971    100286070    0.00106666
genome    9    47419    100286070    0.000472837
genome    10    27403    100286070    0.000273248

To find the coverage, I multiplied col2 (depth) with col3 (number of bases in genome with that depth) and then summed the entire column. Then, I divided it by genome length to get the coverage. In this case, col2 * col3 is:

0
30938928
43528958
35327751
21384832
10676830
4715898
1968974
855768
426771
274030

And the sum is: 150098740. Since the genome length is 100286070, the coverage is 150098740/100286070 = 1.5. That is to say it is, 1.5X. I have only considered the first 10 depth from the file here, but you get the idea. So, is this the right way to get the physical coverage?

Note: While the output file (coverage.txt) gives individual chromosome details, I only took the genome details i.e., col1 labeled as genome.

coverage depth-of-coverage bedtools • 5.7k views
ADD COMMENT
1
Entering edit mode
11.6 years ago

Picard tools' CalculateHsMetrics does perfectly the job directly from the bam file, allowing even to distinguish target from bait design in case it's needed (not on the following example though):

java -jar \$HOME/bin/picard-tools/CalculateHsMetrics.jar \
REFERENCE_SEQUENCE=hg19.fa \
TARGET_INTERVALS=intervals.bed \
BAIT_INTERVALS=intervals.bed \
INPUT=input.bam \
OUTPUT=hsmetrics.txt

all the output fields in that hsmetrics.txt are described at the HsMetrics' class description, which of course include the MEAN_TARGET_COVERAGE you're after.

ADD COMMENT
0
Entering edit mode
11.6 years ago

You kind of seem to go the other way, from more specific answers to less specific ones. Typically the average coverage is very easy to get:

number of reads * length of each read / size of the genome

You have results already counted by individual coverage, and while the formula that you use seems to be right note how the result that you are getting are not that useful. You average coverage ends up to be 1.5x even though 26% of the bases have no coverage at all and 30% of reads have a coverage of 1. I think this is because your distribution is heavily skewed, in which case averages are not that meaningful quantities.

ADD COMMENT
0
Entering edit mode

Does this logic apply to exomes too? Like say, human exome is 30MB I think. And I have 30 million mapped reads, with 32bp reads. Then, the average coverage would be 30X right?

ADD REPLY
0
Entering edit mode

yes, to get the idealized coverage you would divide with the total length of the DNA that is supposedly the source of the reads.

ADD REPLY

Login before adding your answer.

Traffic: 3088 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6