Entering edit mode
7.2 years ago
brs1111
▴
10
I have a SAM/BAM file and I would like to calculate the percentage of the reference sequences are aligned with the reads. Is there a tool that can help to calculate the coverage?
what did you find so far ?
I tried
samtools mpileup
but it got aborted as I run without-r chr3:1,000-2,000
and single BAM file. Also I runbamtools coverage -in in.BAM -out output_reads_coverage.txt
, but not sure how can I calculate the percentage of coverage from output file.What do you mean by the 'percentage of coverage' exactly?
samtools flagstat
will show the percentage of your reads that have aligned to the reference genome. BEDTools, on the other hand, has numerous functions that allow you to determine depth of coverage and read depth of these reads that have aligned [to the reference].Hi Kevin, Thanks for the reply. I used
samtools flagstat
for other statistics. Suppose I have one contigs of length 100 and two reads of length 20 and 30 respectively. Let the first read is aligned from position 10 to 30 and the second read is aligned from position 20 to 50. So the percentage of coverage is 40% here as all bases from position 10 to 50 are aligned.Does BEDTools generate this percentage?
Yes, that's what I thought that you meant. It's not a traditional metric that 's observed but I have to admit that it's interesting and I think that I'll calculate it in my analyses in the future. Thanks!
Anyway, I have been able to do this just now indirectly using BEDTools and looking over the hg38 reference genome:
Bases at 0 read depth:
Bases at >0 read depth:
The
genomecov
function ofBEDTools
normally computes coverage in your BAM file for each position in your supplied reference genome. The output looks like this:I then use the awk commands to calculate total bases at 0 read depth and total bases >0 read depth. After that, the calculation for percentage can easily be done. For my sample (a small targeted sequence panel), I ended up with:
It took less than a minute to run with my commands above.
If you want to automate this in a pipeline, the following code would work (note that in BASH decimal points need to be handled specially with the
bc
command):Like I said, I think that this is a useful metric and you could perhaps turn this into a tutorial(?), if Pierre agreed too.
Thanks Kevin. It works really great. I have just one more question (could be a very basic question as I am new to this sort of analysis). Actually, I have a reference file that contains almost 60K sequences and only 50% of them are aligned with reads. I would like to find the coverage only for the sequences in reference that are aligned. The coverage percentage that I calculated is very low and it could be because of zero bases of all sequences.
Hi again friend,
If you just want the coverage of the aligned reads, then you just change the
-bga
parameter to-bg
, and this will not then output the unaligned reads (with 0 read depth).There are other options here using
bedtools genomecov
: http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.htmlLet me know if this answers your question!