Question

samtools/bcftools mpileup -B option: to use it or not in order to reduce the occurence of false SNPs discovery?

3

Entering edit mode

3.6 years ago

cg1440 ▴ 60

UPDATE: I'll leave this post up since I got a really thorough response from the tool's developer himself. This response might be of a great value later on for someone else. Thank you jkbonfield!

Hey fellow bioinformaticians!

At this point I'm really confused regarding the -B option with the mpileup function.

First to clarify what this option does, it "disables base alignment quality (BAQ) computation"

Now, in the documentation page of samtools mpileup:

BAQ is the Phred-scaled probability of a read base being misaligned. It greatly helps to reduce false SNPs caused by misalignments. BAQ is calculated using the probabilistic realignment method described in the paper “Improving SNP discovery by base alignment quality”

BUT, in the documentation page of bcftools mpileup, they say the exact opposite regarding BAQ:

-B, --no-BAQ Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.

So at this point I'm really confused. Samtools documentation says that computing BAQ helps reduce false SNPs discovery, while the bcftools documentation says that disabling BAQ improves SNPs discovery.

Also, I've tried bcftools mpileup on the same set of data, once without -B and once with it. I got significantly different results: without this option, I got some INDELs in some samples, while with the -B argument, I got no INDELs at all in any of my 42 COVID-19 samples.

Am I missing something? Did I misunderstand the developers' wording?

EDIT: in the ivar manual, the use of -B with samtools mpileup is recommended, but for a different reason:

Please use the -B options with samtools mpileup to call variants and generate consensus. When a reference sequence is supplied, the quality of the reference base is reduced to 0 (ASCII: !) in the mpileup output. Disabling BAQ with -B seems to fix this. This was tested in samtools 1.7 and 1.8

I don't know what to understand from all of this.

covid-19 mpileup bcftools snp samtools • 4.4k views

ADD COMMENT • link 3.6 years ago by cg1440 ▴ 60

0

Entering edit mode

3.6 years ago

swbarnes2 14k

My personal anecdote: omitting -B caused some sanger verified SNPs to be missed. So I would say do include -B.

ADD COMMENT • link 3.6 years ago by swbarnes2 14k

score 8 · Accepted Answer · 2021-04-23

That looks like an error in the bcftools documentation. Good spot.

Generally though it's far more complicated than this.

BAQ assesses the per-position accuracy of an alignment. If the data is complex then it's likely there is a nice 1:1 alignment through the matrix and the Base Alignment Quality is high. For low-complexity indels, eg copy number variations in STRs, there can be multiple places the bases could be added or removed with similar or even identical alignment scores. In these scenarios BAQ will give a low score as it's not sure the alignment produced by the read mapper will be the same for all reads, particularly when close to the end of reads.

It is clear that using BAQ does remove many false positives by reducing the base quality in places where reference bias is likely to creep in. However by its very nature it also removes some true positives too (or increases false negatives if you prefer).

Last year I started looking to speed up bcftools by removing most BAQ calls, as it's the biggest CPU hog. The theory is can we do rapid assessments to identify where there appears to be no problematic alignments in the pileup, meaning BAQ isn't necessary to get the correct answer. (The idea came from Crumble, which assesses multiple alignments to work out which regions need quality values retained and which don't.) Doing this cuts out a good 90% of our BAQ calls, but rather fortuitously this also happens to still remove most false positives while not having the same detrimental impact on false negatives. A rare win/win.

It then turned into a bit of a rabbit hole of improving calling in other ways, particularly indels. This work is mostly complete, barring some tidying up. Some examples of the impact are here: https://github.com/samtools/bcftools/pull/1363#issuecomment-802042033

Hopefully we can get this in the next bcftools release.

Edit: For Covid-19 sequencing, if using amplicon methods, you may also wish to apply this Htslib PR:

https://github.com/samtools/htslib/pull/1273

It makes the overlap removal method randomised between read 1 / read 2. This removes a signficant source of strand bias from amplicon sequencing, which when coupled with BAQ can give major issues to calling. The alternative is to use -x to disable overlap removal. Double counting is a problem on shallow data, but it doesn't really make much difference when we get to high depth.