I have a set of patient VCF files, and I am looking at the genotypes of each variant called. I observe the three expected categories:
0/0 : homozygous for the reference allele
0/1 : heterozygous (one ref allele, one alt allele)
1/1: homozygous for the alternate allele
My question is, what is the point of calling a homozygous reference allele? Is this only done when there is low confidence on the variant call? To as to say "hey, this could be a different genotype, but here's what we got with the data quality given"?
What would I lose if I filtered all rows in my VCFs out with the genotype listed as "0/0"?
how many samples per vcf file ?
Just one sample per file.
I like to know the difference between "confident data we see hom-ref" and "no-data". Your single-sample VCF is only showing mutant sites, but I often run a VCF at a given set of coordinates, and I want to know which of the four situations is going on. 0/0 or 0/1 or 1/1 or no-data.
Ah, so I think what both Pierre and Karl are saying is "0/0" help to rule OUT a disease if you know the variant is found at location X and for patientA, they have "0/0" genotype at location X.
However, if you are looking for patterns in an unsupervised fashion, and want to look for variants that you can then attribute to a patient's phenotype, filtering out all "0/0"'s makes sense, because you want to prioritize the 0/1 and 1/1 calls.
Did I represent what you are saying correctly?
Also, I assume "no data" in a VCF is just the "absence" of information for a particular locus?
and how did you generate the VCF ?
I didn't generate the VCF. It was given to me by a collaborator. What decisions regarding generation would affect the answer to this question?
We used Atlas-SNP2 v1.4.3, with the following flags (removed the input and output default/required flags) and filters:
-F -y 6 -s --Illumina -f 3500
tools like samtools are free to print all the position or the variant only. It's useful later for example when you want to merge some VCFs. If you have all the positions there will be no ambiguity between a HOM_REF and a NO_CALL for a missing value.