From:
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
SNPs and Small Indels
For example, suppose we are looking at a locus in the genome:
Ref: a t C g a // C is the reference base
: a t G g a // C base is a G in some individuals
: a t - g a // C base is deleted w.r.t. the reference
: a t CAg a // A base is inserted w.r.t. the reference sequence
In the above cases, what are the alleles and how would they be represented as a VCF record?
First is a SNP polymorphism of C/G → { C , G } → C is the reference allele
20 3 . C G . PASS DP=100
Second, 1 base deletion of C → { tC , t } → tC is the reference allele
20 2 . TC T . PASS DP=100
Third, 1 base insertion of A → { tC ; tCA } → tC is the reference allele
20 2 . TC TCA . PASS DP=100
OK if witnessed independently, this tC->tCA would be c->cA. Right?
20 3 . C CA . PASS DP=100
It seems like if we did not observe 3C as a deleted position in this file then it could be used as a reference base, but since it was we have to aggregate it.
This is not in the spec, but I assume this is the correct aggregation:
20 2 . TC TG,T,TCA . PASS DP=100
What is the rule that is being applied here? Is there a spec that describes this more precisely? Is there a name for this strategy (other than VCF)?
I see what you mean. They could have aggregated the example:TC=>TG,T,TCA. Maybe they meant to show three independent entries before aggregation and they just messed up on that last TC>TCA?
the spec is a bit incomplete here. I guess my issue is that there are good examples but the rules governing the aggregation are vague.
You may want to search the archives of the vcf-tools-spec mailing list, or if that fails, ask the group. They field questions like this all the time: https://lists.sourceforge.net/lists/listinfo/vcftools-spec