I'm expanding on a number of other related questions and seeking clarification.
I've read and understand these posts as well as the linked papers and VCF specification.
https://bioinformatics.stackexchange.com/questions/14356/how-is-the-gt-field-in-a-vcf-file-defined
What Does Genetype ("0/0", "0/1" Or "1/1") In *.Vcf File Represent?
GT and GL fields in VCF file
VCF Files: Help on 0/1 1/1 0/0 1/1 | vs / (phased & unphased
Meaning Of Genotype 2/0 And 2/1
Tell me if this is a good understanding of variant calling:
DNA is chopped up and turned into soup with both mother and father strands scattered about. Strands are sequenced and reported as a list of nucleotide calls along with quality scores. An aligner tries to guess where on a reference genome each strand should go. These pile up along the coordinates of the genome. A variant caller compares the alignment to the reference. If, at a given coordinate, high-quality calls differ from the reference, then the variant caller flags that coordinate has having a variant.
Possibilities include:
- No nucleotides are present with high-confidence
- The reference nucleotide is present with high confidence.
- One or more alternate nucleotides are present with high-confidence.
Here are the things I still want to clarify:
- Under what conditions would a variant caller give "./0" or"0/."? Does that mean both the reference and low-quality nucleotide calls?
- Under what conditions would a variant caller give "./1" or "1/."? Again, what is the purpose of the "." The manual says "If a call cannot be made for a sample at a given locus, ‘.’ should be specified for each missing allele in the GT field." But I'm pretty sure i've seen calls like the ones above.
- Is there an accepted convention, or will some callers use both
1/0
and0/1
for a het call? Likewise, "1/2" vs. "2/1" for a biallelic call. They mean the exact same thing for unphased data, right? - Would the call
./0
be possible in a gVCF (where all coordinates are included whether or not there is a called variant) - Under what conditions would a caller report "0/2" or "2/0" for a call? If there are two alternates found, then why isn't the "first" alternate included in the GT? I have seen this and others have reported it as well.
For example, consider this:
POS REF ALT GT
1 A T,G 0/2
2 A A,G 0/2
In the first case, there are two alts, so shouldn't the GT have a "1" somewhere? In the second case, how is this different from ALT = "G"? Is it that both the REF ("A") and ALT ("G") were reported with high quality, as opposed to only "G"?
EDIT My description of variant calling was too simple and missed an essential point. The variants are called in a diploid manner. Without phasing, can't know which allele a variant is from. But you can assume _a-priori_ that there _are_ two alleles. So the caller doesn't just look for "high-quality" single calls as I described. It actually looks for high-quality calls including two alleles.
https://en.wikipedia.org/wiki/SNV_calling_from_NGS_data
Genotyping—Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate a likelihood that each genotype is the truth, given the entire read pileup observed. Genotypes with maximum likelihood are reported.