UPDATE: I'll leave this post up since I got a really thorough response from the tool's developer himself. This response might be of a great value later on for someone else. Thank you jkbonfield!
Hey fellow bioinformaticians!
At this point I'm really confused regarding the -B
option with the mpileup
function.
First to clarify what this option does, it "disables base alignment quality (BAQ) computation"
Now, in the documentation page of samtools mpileup:
BAQ is the Phred-scaled probability of a read base being misaligned. It greatly helps to reduce false SNPs caused by misalignments. BAQ is calculated using the probabilistic realignment method described in the paper “Improving SNP discovery by base alignment quality”
BUT, in the documentation page of bcftools mpileup, they say the exact opposite regarding BAQ:
-B, --no-BAQ
Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.
So at this point I'm really confused. Samtools documentation says that computing BAQ helps reduce false SNPs discovery, while the bcftools documentation says that disabling BAQ improves SNPs discovery.
Also, I've tried bcftools mpileup
on the same set of data, once without -B
and once with it. I got significantly different results: without this option, I got some INDELs in some samples, while with the -B
argument, I got no INDELs at all in any of my 42 COVID-19 samples.
Am I missing something? Did I misunderstand the developers' wording?
EDIT: in the ivar
manual, the use of -B
with samtools mpileup
is recommended, but for a different reason:
Please use the
-B
options withsamtools mpileup
to call variants and generate consensus. When a reference sequence is supplied, the quality of the reference base is reduced to 0 (ASCII: !) in thempileup
output. Disabling BAQ with-B
seems to fix this. This was tested insamtools
1.7 and 1.8
I don't know what to understand from all of this.
Also for with-BAQ vs without-BAQ see this pic from the above github PR:
The blue bars at the top right is elapsed time, shoing how significant BAQ is to the running time of "bcftools mpileup".
This was an earlier version of the PR so the results have changed since then for "partial" mode, but basically it shows the general gist. Each dot there is a different quality filtering level in increments of 10 (GATK may have been bigger increments, I forget). So bottom right is all variants (QUAL>=0), somewhere closer to bottom left corner are around QUAL>=100 ish, and the top left are higher levels of QUAL. As expected increasing the QUAL filter reduces FP (higher precision) but increases FN (lower recall). There's a "sweet spot", but where it is depends on your goals.
It's clear that at 60x coverage BAQ has a significant hit to recall rates (higher Y) but also improves accuracy (lower X). However for most realistic QUAL filtering levels, you're better off disabling BAQ and just increasing your QUAL filter level a bit to compensate. Eg Q>=70 no-BAQ is lower FN and lower FP than Q>=30 full-BAQ. At very high filtering this isn't true, with the change over point around QUAL>=120. The partial BAQ mode (NB this has been revised since, so don't draw too much into that line) did better than either.
At lower depth, eg 15x, it's totally different and BAQ always wins out. This is fair enough as it's how BAQ was initially written. When Heng Li wrote it he was working on 1000 Genomes Project which was doing very shallow sequencing of lots of samples. Data has moved on since then, and the algorithms and parameters aren't well tuned.
As a final comment on tuning, I'd also recommend upping the -h option from 100 to maybe 500 for modern Illumina data sets, ie
bcftools mpileup -h 500
. This is a major benefit for indel calling. It's a homopolymer error bias, so indels in homopolymers get rejected if the homopolymer length is too long, reflecting the likelihood of sequencing artifact being higher than genuine variant. Again it's a product of the time. In 1000G era, that was true. On modern Illumina platforms it's rarely a major source of errors. While you're at it, also best to add-m 2
or-m 3
to filter the more rediculous indel false positives. Another product of the extreme low-depth origins of bcftools that doesn't apply so much now.