I'm researching variant nomenclature, and I stumbled upon this incredible duo of concepts that are strangely mutually contradictory. There are a few differences in where they are applied, but I'd love to hear your thoughts on this.
This wiki elucidates a protocol to normalize variants in VCF files, using left-alignment and parsimony. This makes sense for VCF, where a REF and an ALT are critical components, although why they dismiss the possibility of an empty ALT is not clear. To summarize the post, they recommend picking the minimal ref+alt combination in the left-most position of the alignment.
HGVS nomenclature, for deletions, recommends picking the most 3' of all possible nomenclatures a variant can be assigned. For example, if ATGCACACACATGG
were to lose a CA
, HGVS recommends it should be c.10_11delCA
, and not c.4_5
or c.6_7
or c.8_9
, because there is at least one position further 3', where the exact same deletion will result in an identical mutant sequence.
For the above case, normalization recommends REF
be GCA
and ALT
be G
, thus picking the most 5' position.
I understand that VCF and HGVS nomenclature have different aims, but how do we address this as we canonicalize variant notation across the board?
Including a few people that I've seen working on variants and HGVS conventions: dandan Jeremy Leipzig
I've only recently worked on variant analysis, so I'm ignorant, but interested. In the example, it seems that
c.10_11delCA
is a different variant thanc.4_5delCA
, orc.6_7
,c.8_9
. Hence the choice of which of these HGVS names to use should be determined by which of the variants actually occurred, rather than the convention-based approach of selecting the most 3' option that leads to the same final mutated context sequence. The identification of which of the 4 possible HGVS variants occurred would be determined (or not) by the specificity of the assay. For example by using a probe that targets thec.10_11
deletion, versus each of the others. In this line of thinking, there is no contradiction between the two nomenclatures if the position of the variant is fully specified. No?In the linked wiki, to my understanding they dismiss the possibility of an empty ALT because they say "The representation of variants in a VCF file requires that no alleles in the REF and ALT field are represented with an empty string".
The mol bio person in my lab said that it is not possible to determine exact location of deletion when repeats are involved, hence my assumption that the variants are equivalent. If it is indeed possible, then yes, such attribution makes sense.
And from what I've seen, VCF files use the deleted sequence as
REF
and a.
asALT
for deletions, and the other way around for insertions.If I may reply to this super old discussion simply because people are going to find this;