Hello,
As I understand, QUAL is a representation of accuracy of genotyping. But what does a '.' represent under the QUAL column in a VCF file? I do not have any numeric value for Phred-scaled score for assertion of ALT allele in the entire column.
What does this mean for filtering low quality SNPs or genotypes?
Thank you.
EDIT: More information:
As I was looking as a filtered.recoded VCF file, I went back & checked the raw VCF file as well. This file had all the values for QUAL & INFO field. My service provider have responsed as 'The TASSEL-GBS pipeline does not calculate quality scores for any sites, but assigns an arbitrary, uniform value of 20 for each SNP in the VCF files. In my VCF files, and in all four cases there is only 1 QUAL score (20) for all SNPs which somehow appears a a '.' in the filtered recoded file. So, I should not use minQ for filtering SNPs, right?
Just so we eliminate a possible glitch, are you sure the . is in the QUAL field? If you're looking at the file - just eyeballing it, it is highly possible the header may not align with the right field, and you may be seeing the . from the FILTER field. Maybe try counting values in that record or using awk or cut to view values?
Thanks for your reply Ram,but I am sure of looking at the QUAL column. An example
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 10
ET_C6390828 41 S1_612208905 G T . PASS .;DP=119 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0
ET_C6410100 69 S1_614033230 A G . PASS .;DP=2833 GT:AD:DP:GQ:PL 0/1:19,19:38:100:255,0,255 0/1:5,10:15:99:255,0,135
I am a biologist and still trying to learn bioinformatics. I am afraid, I may not be familiar with very technical terms.
As I was looking as a filtered.recoded VCF file, I went back & checked the raw VCF file as well. This file had all the values for QUAL & INFO field. My service provider have responsed as 'The TASSEL-GBS pipeline does not calculate quality scores for any sites, but assigns an arbitrary, uniform value of 20 for each SNP in the VCF files. In my VCF files, and in all four cases there is only 1 QUAL score (20) for all SNPs which somehow appears a a '.' in the filtered recoded file. So, I should not use minQ for filtering SNPs, right? Thank you.
Thank you for the link & your suggestions. As implied there, the first one is a no-call site because there is no QUAL and no genotype, which holds true for the first SNP. Still confused about the second though! The VCF header says nothing specific about qual:
##fileformat=VCFv4.0
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
It appears that your VCF is malformed. In addition to the missing QUAL scores (missing data are represented by '.'), the INFO field is missing some of the data specified by the header (e.g., NS and AF values). I recommend that you contact your service provider to obtain the correct VCF.
"Unified Genotyper writes LowQual if the variant fails the calling threshold, but only writes a dot if it passes."
Edit: the OP is correct that this statement applies to the FILTER field. The complete text explains that 'PASS' in the FILTER field (as in the OP's example) indicates filtering after variant calling.
Thanks for the thread. But in the example posted there, isn't the lowQual for specific for the 'Filter' column and not the 'Qual' column? In that case, if the variant will passes the filtering criteria/ threshold, the genotyper will insert a dot. If the variant fails the the filtering criteria/ threshold, the genotyper will insert a LowQual.
Though as explained in the GATK forum, in my case, I can see 'PASS' under the all the Filter columns as the VCF file was subsequently filtered for MAF and missing data per site by my the service provider.
Just so we eliminate a possible glitch, are you sure the
.
is in theQUAL
field? If you're looking at the file - just eyeballing it, it is highly possible the header may not align with the right field, and you may be seeing the.
from theFILTER
field. Maybe try counting values in that record or usingawk
orcut
to view values?Thanks for your reply Ram,but I am sure of looking at the QUAL column. An example
I am a biologist and still trying to learn bioinformatics. I am afraid, I may not be familiar with very technical terms.
You're right, it is in the
QUAL
field.As I was looking as a filtered.recoded VCF file, I went back & checked the raw VCF file as well. This file had all the values for QUAL & INFO field. My service provider have responsed as 'The TASSEL-GBS pipeline does not calculate quality scores for any sites, but assigns an arbitrary, uniform value of 20 for each SNP in the VCF files. In my VCF files, and in all four cases there is only 1 QUAL score (20) for all SNPs which somehow appears a a '.' in the filtered recoded file. So, I should not use minQ for filtering SNPs, right? Thank you.
For questions like this, read the spec first. In VCF, "." at QUAL means a missing value – i.e. the QUAL is unknown.