I would like to calculate some very general summary statistics on bam files: average read depth and breadth of coverage at >= 1x and >10x.
I am using samtools
and sambamba
to calculate depth
and they seem to produce differing results as evidenced by the summary files below.
According to the documentation, these are the default filters:
samtools depth
filters reads that have any of the flags UNMAP, SECONDARY, QCFAIL, or DUP (-g option)sambamba depth
filters reads that have 'mapping_quality > 0 and not duplicate and not failed_quality_control' (-F option)
For the analysis I am doing, I want to be stringent an exclude any multimappers, low quality alignments, duplicates or secondary alignments. Therefore, I have specified the filter options in both commands as shown below.
Here is my samtools command:
samtools depth -a -Q 1 test.bam > test.genomeCoverage
This is the head of the output:
chr1 1 5
chr1 2 5
chr1 3 5
chr1 4 5
chr1 5 5
chr1 6 5
chr1 7 5
chr1 8 5
chr1 9 5
chr1 10 5
Here is my sambamba command:
sambamba depth base -c 0 -t 8 -F "mapping_quality > 0 and not (duplicate or failed_quality_control or unmapped or secondary_alignment)" -o test.genomeCoverage test.bam
This is the head of the resulting file:
REF POS COV A C G T DEL REFSKIP SAMPLE
chr1 0 5 0 5 0 0 0 0 *
chr1 1 5 0 0 0 5 0 0 *
chr1 2 5 5 0 0 0 0 0 *
chr1 3 5 5 0 0 0 0 0 *
chr1 4 5 0 5 0 0 0 0 *
chr1 5 5 0 5 0 0 0 0 *
chr1 6 5 0 5 0 0 0 0 *
chr1 7 5 0 0 0 5 0 0 *
chr1 8 5 5 0 0 0 0 0 *
I then produce summary statistics from that 'test.genomeCoverage' file with the following code:
echo -n -e "Average read depth across genome:\t" > ${o}_summaryStats
cat test.genomeCoverage | awk '{sum+=$3} END {print sum/NR}' >> ${o}_summaryStats
echo -n -e "% genome (breadth) covered >0x:\t" >> ${o}_summaryStats
cat test.genomeCoverage | awk '{c++; if($3>0) total+=1}END{print (total/c)*100}' >> ${o}_summaryStats
echo -n -e "% genome (breadth) covered >10x:\t" >> ${o}_summaryStats
cat test.genomeCoverage | awk '{c++; if($3>10) total+=1}END{print (total/c)*100}' >> ${o}_summaryStats
The output for the file produced with samtools is this:
Average read depth across genome: 84.1805
% genome (breadth) covered >0x: 97.7017
% genome (breadth) covered >10x: 94.7993
The output for the file produced with sambamba is this:
Average read depth across genome: 84.3173
% genome (breadth) covered >0x: 97.7089
% genome (breadth) covered >10x: 94.8096
Obviously, the difference is small but still I am confused and would like to understand where it is coming from. I would be grateful for any suggestions or hints as to how I could check where that difference is coming from.
Context: this is a small benchmarking project of Nanopore vs Illumina sequencing which is why I want to get these stats right. I would expect Nanopore to have a higher breadth because of the better coverage in repetitive regions. This is also the reason why I want a stringent filter which excludes any multimappers or reads that are in any way ambiguously aligned.
There is an option in samtools depth to control this, but you're correct on the default behaviour: