Question

Coverage of reference from the BAM file

0

Entering edit mode

7.2 years ago

brs1111 ▴ 10

I have a SAM/BAM file and I would like to calculate the percentage of the reference sequences are aligned with the reads. Is there a tool that can help to calculate the coverage?

bam sam alignment • 4.4k views

ADD COMMENT • link 7.2 years ago by brs1111 ▴ 10

0

Entering edit mode

what did you find so far ?

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I tried samtools mpileup but it got aborted as I run without -r chr3:1,000-2,000 and single BAM file. Also I run bamtools coverage -in in.BAM -out output_reads_coverage.txt, but not sure how can I calculate the percentage of coverage from output file.

ADD REPLY • link 7.2 years ago by brs1111 ▴ 10

0

Entering edit mode

What do you mean by the 'percentage of coverage' exactly?

samtools flagstat will show the percentage of your reads that have aligned to the reference genome. BEDTools, on the other hand, has numerous functions that allow you to determine depth of coverage and read depth of these reads that have aligned [to the reference].

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, Thanks for the reply. I used samtools flagstat for other statistics. Suppose I have one contigs of length 100 and two reads of length 20 and 30 respectively. Let the first read is aligned from position 10 to 30 and the second read is aligned from position 20 to 50. So the percentage of coverage is 40% here as all bases from position 10 to 50 are aligned.

Does BEDTools generate this percentage?

ADD REPLY • link 7.2 years ago by brs1111 ▴ 10

1

Entering edit mode

Yes, that's what I thought that you meant. It's not a traditional metric that 's observed but I have to admit that it's interesting and I think that I'll calculate it in my analyses in the future. Thanks!

Anyway, I have been able to do this just now indirectly using BEDTools and looking over the hg38 reference genome:

Bases at 0 read depth:

bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1

Bases at >0 read depth:

bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1

The genomecov function of BEDTools normally computes coverage in your BAM file for each position in your supplied reference genome. The output looks like this:

chr1    0   11794364    0
chr1    11794364    11794488    3
chr1    11794488    11794490    5
chr1    11794490    11794514    6
chr1    11794514    11794639    3
chr1    11794639    14022492    0
chr1    14022492    14022535    1
chr1    14022535    97082336    0
chr1    97082336    97082442    1
chr1    97082442    97082486    2
chr1    97082486    97082593    1
chr1    97082593    97305303    0
chr1    97305303    97305377    2

I then use the awk commands to calculate total bases at 0 read depth and total bases >0 read depth. After that, the calculation for percentage can easily be done. For my sample (a small targeted sequence panel), I ended up with:

34312 non-zero bases
3209251793 zero bases
(34312 / (3209251793 + 34312 )) * 100 = 0.0011% genome coverage

It took less than a minute to run with my commands above.

If you want to automate this in a pipeline, the following code would work (note that in BASH decimal points need to be handled specially with the bc command):

zero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)

nonzero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)

percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")

echo $percent

Like I said, I think that this is a useful metric and you could perhaps turn this into a tutorial(?), if Pierre agreed too.

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

1

Entering edit mode

Thanks Kevin. It works really great. I have just one more question (could be a very basic question as I am new to this sort of analysis). Actually, I have a reference file that contains almost 60K sequences and only 50% of them are aligned with reads. I would like to find the coverage only for the sequences in reference that are aligned. The coverage percentage that I calculated is very low and it could be because of zero bases of all sequences.

ADD REPLY • link 7.2 years ago by brs1111 ▴ 10

0

Entering edit mode

Hi again friend,

If you just want the coverage of the aligned reads, then you just change the -bga parameter to -bg, and this will not then output the unaligned reads (with 0 read depth).

There are other options here using bedtools genomecov: http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html

Let me know if this answers your question!

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k