Question

samtools depth and sambamba depth produce different results

0

Entering edit mode

3.6 years ago

Johannes • 0

I would like to calculate some very general summary statistics on bam files: average read depth and breadth of coverage at >= 1x and >10x. I am using samtools and sambamba to calculate depth and they seem to produce differing results as evidenced by the summary files below.
According to the documentation, these are the default filters:

samtools depth filters reads that have any of the flags UNMAP, SECONDARY, QCFAIL, or DUP (-g option)
sambamba depth filters reads that have 'mapping_quality > 0 and not duplicate and not failed_quality_control' (-F option)

For the analysis I am doing, I want to be stringent an exclude any multimappers, low quality alignments, duplicates or secondary alignments. Therefore, I have specified the filter options in both commands as shown below.

Here is my samtools command:

samtools depth -a -Q 1 test.bam > test.genomeCoverage

This is the head of the output:

chr1    1   5
chr1    2   5
chr1    3   5
chr1    4   5
chr1    5   5
chr1    6   5
chr1    7   5
chr1    8   5
chr1    9   5
chr1    10  5

Here is my sambamba command:

sambamba depth base -c 0 -t 8 -F "mapping_quality > 0 and not (duplicate or failed_quality_control or unmapped or secondary_alignment)" -o test.genomeCoverage test.bam

This is the head of the resulting file:

REF POS COV A   C   G   T   DEL REFSKIP SAMPLE
chr1    0   5   0   5   0   0   0   0   *
chr1    1   5   0   0   0   5   0   0   *
chr1    2   5   5   0   0   0   0   0   *
chr1    3   5   5   0   0   0   0   0   *
chr1    4   5   0   5   0   0   0   0   *
chr1    5   5   0   5   0   0   0   0   *
chr1    6   5   0   5   0   0   0   0   *
chr1    7   5   0   0   0   5   0   0   *
chr1    8   5   5   0   0   0   0   0   *

I then produce summary statistics from that 'test.genomeCoverage' file with the following code:

echo -n -e "Average read depth across genome:\t"  > ${o}_summaryStats
cat test.genomeCoverage | awk '{sum+=$3} END {print sum/NR}' >> ${o}_summaryStats
echo -n -e "% genome (breadth) covered >0x:\t" >> ${o}_summaryStats
cat test.genomeCoverage | awk '{c++; if($3>0) total+=1}END{print (total/c)*100}' >> ${o}_summaryStats
echo -n -e "% genome (breadth) covered >10x:\t" >> ${o}_summaryStats
cat test.genomeCoverage | awk '{c++; if($3>10) total+=1}END{print (total/c)*100}' >> ${o}_summaryStats

The output for the file produced with samtools is this:

Average read depth across genome:   84.1805
% genome (breadth) covered >0x: 97.7017
% genome (breadth) covered >10x:    94.7993

The output for the file produced with sambamba is this:

Average read depth across genome:   84.3173
% genome (breadth) covered >0x: 97.7089
% genome (breadth) covered >10x:    94.8096

Obviously, the difference is small but still I am confused and would like to understand where it is coming from. I would be grateful for any suggestions or hints as to how I could check where that difference is coming from.

Context: this is a small benchmarking project of Nanopore vs Illumina sequencing which is why I want to get these stats right. I would expect Nanopore to have a higher breadth because of the better coverage in repetitive regions. This is also the reason why I want a stringent filter which excludes any multimappers or reads that are in any way ambiguously aligned.

sambamba samtools depth coverage • 3.6k views

ADD COMMENT • link updated 2.7 years ago by jkbonfield ★ 1.3k • written 3.6 years ago by Johannes • 0

score 2 · Answer 1 · 2022-02-06

2

Entering edit mode

2.8 years ago

Trevor ▴ 20

I think for sambamba, the sum of depth that corresponds to samtools's depth is SUM(COV) - SUM(DEL)

i.e.

cat test.genomeCoverage | awk '{sum+=$3-=$8} END {print sum/NR}' >> ${o}_summaryStats

You mind giving it a try?

ADD COMMENT • link 2.8 years ago by Trevor ▴ 20

0

Entering edit mode

There is an option in samtools depth to control this, but you're correct on the default behaviour:

  -J           Include reads with deletions in depth computation

ADD REPLY • link 2.7 years ago by jkbonfield ★ 1.3k