So I've sequenced samples with MinION NGS platform, and analysed for SNPs via SAMtools/BCFtools. To corroborate or reject these SNVs I've sequenced the same samples via Sanger.
According to the VCF spec (V4.2):
Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified. (Numeric)
When examining the VCF file generated by the SAMtools/BCFtools pipeline, I find that the QUAL column indicates values from as low as 5.0... to as high as 196.0 for an alternative allele, with DP values on the order of 10^3 (which makes me happy, as it increases my confidence of this position being a SNP).
When I sequence by Sanger the same position, it may also support the alternative allele. But Sanger quality scores are maximum 60 in my data, and it seems from the The Sanger FASTQ file format for sequences with quality scores, that the Phred score indicated here are similar: -10*log10(Probability of base erroneously called).
- Why are these quality scores (NGS, Sanger) so distinctly different than each other?
- Is MinION NGS data not coded in Sanger+33 ASCII base, same as Sanger?
- Is a comparison between the two quality scores a valid one, to some extent?
- Edit: is there another parameter in the VCF format of my NGS data which correspond to the Sanger phred score, perhaps 'MQ' ("Root-mean-square mapping quality of covering reads")?
Thanks for the elaborate answer Devon. True, MinION data is generally less accurate, but this is also why I've filtered them (average Q score > 7) prior to alignment, and obviously, as you've mentioned, there are tens, sometime 100's of thousands of reads which are aligned against the reference.
BTW, Ryan, for context sake, I'm dealing with finding SNVs which by nature may be underexpressed in comparison to the reference allele. Which is why Sanger too, may not be the best validation tool as far as I known. Only Illumina, which is not qualitative, but quantitative, may produce an indication which is reliable to the best extent.
I've never looked into the details of
bcftools
(samtools mpileup is deprecated) and suggest you search this site for thresdolding variant calls.It sounds like you're calling variants from RNAseq data, in which case please try to note that in your questions, since it's fairly atypical.