Question

Calculating Coverage Using CIGAR String

4

Entering edit mode

7.6 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.6k

I'm using the equation

Num_Reads * Avg_Read_Length / Genome Size

To calculate coverage. Here are my questions

Should I only consider mapped bases in the Query to calculate the read length? Specifically, M,= tags in the CIGAR string? For example, if a read has 60M10S for a CIGAR string, would the read length be 60 since 10 were not aligned properly?
Given 1. why is it that my calculation of coverage differs from samtools mpileup? Wouldn't the average of column 4 in mpileup output equal the average coverage?

Thanks for any help.

samtools bamtools coverage • 3.5k views

ADD COMMENT • link 7.6 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k

score 7 · Accepted Answer · 2017-05-03

7

Entering edit mode

7.6 years ago

Pierre Lindenbaum 164k

1) yes M , = and even X

2) yes , ignore the clip

3) did you consider some options like:

  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
  --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set
                                            [UNMAP,SECONDARY,QCFAIL,DUP]

what's the output of samtools depth

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

Edit: Fixed it :) Thanks for the response. I'm using Bamtools API for C++ (and learning C++). Here's my code

#include "api/BamMultiReader.h"
#include "api/BamWriter.h"
#include <string>
#include <vector>
#include <iostream>
#include <numeric>
using namespace BamTools;
using namespace std;
int main(int argc, char *argv[])
{
    vector<string> inBAM;
    BamMultiReader reader;
    string ifh = string(argv[1]);
    inBAM.push_back(ifh);
    if (!reader.Open(inBAM)){
            cerr << "ERROR: " << ifh << " could not be opened!" << endl;
            return 1;
    }
    BamAlignment al;
    const int start=14990508; // This is the first position in my sample given mpileup (converted to 0 base)
    const int end=15118428; // This is the last position in my sample given mpileup
    double cov=0;
    int nreads=0;
    vector<int> lens;
    while (reader.GetNextAlignmentCore(al)){
            if(al.MapQuality < 10 || al.IsDuplicate()==true || al.IsFailedQC()==true || al.IsMapped()==false || al.IsPrimaryAlignment()==false)
            { continue; }
            int rlen=0;
            vector<CigarOp> cigar = al.CigarData;
            for(vector<CigarOp>::iterator it= cigar.begin(); it != cigar.end(); ++it)
            {
                    if (it->Type == 'M' || it->Type == '=' || it->Type== 'X') {
                            rlen+=it->Length;
                    }
            }
            if(rlen!=0){ ++nreads; }
            lens.push_back(rlen);
    }
    float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();
    cov = (nreads*avg)/(end-start);
    cout << "NREADS: " << nreads << " AVG_READLEN: " << avg << " RANGE: " << end-start << endl;
    cout << cov << endl;
    return 0;
}

Here's the output

NREADS: 354 AVG_READLEN: 3073.54 RANGE: 127920 8.50557

Here's my samtools mpileup command

 samtools mpileup -A -B -q 10 -Q 0 test.bam | cut -f 4 >tmp

And when I take the average of tmp it's 9.52

Initially before I included some of the flags for mpileup that you recommended I got a value of 4.5X. So this is closer to what mpileup reports

Not sure as to why I'm getting different results.

Edit: samtools depth

I forgot about samtools depth. It's a strong tool.

samtools depth -a -r 1:14990509-15118429 -q 0 -Q 10 test.bam | cut -f 3 >tmp1

The mean of tmp1 is

8.51

Which is the same as the C++ code. So solved! Thanks for helping me learn C++.

ADD REPLY • link 7.6 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k

2

Entering edit mode

are you sure about this:

float avg = accumulate (lens.begin(),lens.end(), 0.0)/ lens.size();
cov = (nreads*avg)/(end-start);

why don't you calculate the real coverage at each position of the interval ?

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

I edited the response above.

To answer your question, my goal is to calculate coverage for each chromosome without interrogating each position. As an exercise, I wanted to calculate coverage of an interval (eventually a chromosome) by iterating each read mapped to that region.

Would calculating the coverage for each position be faster than iterating through mapped reads?

ADD REPLY • link 7.6 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k

3

Entering edit mode

Why don't you remove vector<int> lens and instead have int total=0; and total+=it->Length() inside the loop and cov = total*1.0/(end-start+1)?