Question

Variant calling: Samtools QUAL from NGS data Vs. Sanger quality score

0

Entering edit mode

6.2 years ago

omer.k ▴ 110

So I've sequenced samples with MinION NGS platform, and analysed for SNPs via SAMtools/BCFtools. To corroborate or reject these SNVs I've sequenced the same samples via Sanger.

According to the VCF spec (V4.2):

Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified. (Numeric)

When examining the VCF file generated by the SAMtools/BCFtools pipeline, I find that the QUAL column indicates values from as low as 5.0... to as high as 196.0 for an alternative allele, with DP values on the order of 10^3 (which makes me happy, as it increases my confidence of this position being a SNP).

When I sequence by Sanger the same position, it may also support the alternative allele. But Sanger quality scores are maximum 60 in my data, and it seems from the The Sanger FASTQ file format for sequences with quality scores, that the Phred score indicated here are similar: -10*log10(Probability of base erroneously called).

Why are these quality scores (NGS, Sanger) so distinctly different than each other?
Is MinION NGS data not coded in Sanger+33 ASCII base, same as Sanger?
Is a comparison between the two quality scores a valid one, to some extent?
Edit: is there another parameter in the VCF format of my NGS data which correspond to the Sanger phred score, perhaps 'MQ' ("Root-mean-square mapping quality of covering reads")?

Sanger NGS • 2.3k views

ADD COMMENT • link 6.2 years ago by omer.k ▴ 110

score 1 · Answer 1 · 2018-09-02

1

Entering edit mode

6.2 years ago

Devon Ryan 104k

You're comparing apples and oranges. For the VCF file, it's the probability that the variant call was wrong. In the fastq file it's the probability that the single base was called incorrectly. The call in the VCF file is derived from potentially thousands of reads, the fastq file is a single read. As an aside, actual Sanger sequencing produces files in a different format, I assume you actually used NGS given that.
Yes, but that's not relevant to what's in the VCF file. The MinION reads are MUCH lower quality than that, it's only their combined effect that allows accurate variant calling.
Actual Sanger sequencing is absurdly more reliable than a MinION. Illumina sequencing is also more reliable, though much less so.
As indicated in 1, you're not looking at similar things. Have a look at the hdf5 from the Minion. It will contain base qualities that are comparable to the Sanger (or Illumina) scores. You'll note that they're lower.

Actual Sanger base scores can be >60. When you call variants you usually use more than 1 of them (ideally two in each direction), so the resulting variant quality score would again be higher (100+).

ADD COMMENT • link 6.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the elaborate answer Devon. True, MinION data is generally less accurate, but this is also why I've filtered them (average Q score > 7) prior to alignment, and obviously, as you've mentioned, there are tens, sometime 100's of thousands of reads which are aligned against the reference.

What is the general range of 'QUAL' which SAMtools mpileup produces, and how is it calculated?
On the same note, what would be a good lower cap on these scores to indicate a SNV trust-worthy?

BTW, Ryan, for context sake, I'm dealing with finding SNVs which by nature may be underexpressed in comparison to the reference allele. Which is why Sanger too, may not be the best validation tool as far as I known. Only Illumina, which is not qualitative, but quantitative, may produce an indication which is reliable to the best extent.

ADD REPLY • link 6.2 years ago by omer.k ▴ 110

1

Entering edit mode

I've never looked into the details of bcftools (samtools mpileup is deprecated) and suggest you search this site for thresdolding variant calls.

It sounds like you're calling variants from RNAseq data, in which case please try to note that in your questions, since it's fairly atypical.

ADD REPLY • link 6.2 years ago by Devon Ryan 104k