Entering edit mode
7.7 years ago
QVINTVS_FABIVS_MAXIMVS
★
2.6k
I'm using the equation
Num_Reads * Avg_Read_Length / Genome Size
To calculate coverage. Here are my questions
Should I only consider mapped bases in the Query to calculate the read length? Specifically,
M
,=
tags in the CIGAR string? For example, if a read has60M10S
for a CIGAR string, would the read length be 60 since 10 were not aligned properly?Given 1. why is it that my calculation of coverage differs from
samtools mpileup
? Wouldn't the average of column 4 in mpileup output equal the average coverage?Thanks for any help.
Edit: Fixed it :) Thanks for the response. I'm using Bamtools API for C++ (and learning C++). Here's my code
Here's the output
NREADS: 354 AVG_READLEN: 3073.54 RANGE: 127920 8.50557
Here's my samtools mpileup command
And when I take the average of
tmp
it's 9.52Initially before I included some of the flags for mpileup that you recommended I got a value of 4.5X. So this is closer to what mpileup reports
Not sure as to why I'm getting different results.
Edit:
samtools depth
I forgot about samtools depth. It's a strong tool.
The mean of tmp1 is
8.51
Which is the same as the C++ code. So solved! Thanks for helping me learn C++.
are you sure about this:
why don't you calculate the real coverage at each position of the interval ?
I edited the response above.
To answer your question, my goal is to calculate coverage for each chromosome without interrogating each position. As an exercise, I wanted to calculate coverage of an interval (eventually a chromosome) by iterating each read mapped to that region.
Would calculating the coverage for each position be faster than iterating through mapped reads?
Why don't you remove
vector<int> lens
and instead haveint total=0;
andtotal+=it->Length()
inside the loop andcov = total*1.0/(end-start+1)
?That's a great idea. I will try that out!