Forum:Variant Normalization described in wiki site is in direct contradiction with HGVS conventions
0
5
Entering edit mode
9.4 years ago
Ram 44k

I'm researching variant nomenclature, and I stumbled upon this incredible duo of concepts that are strangely mutually contradictory. There are a few differences in where they are applied, but I'd love to hear your thoughts on this.

This wiki elucidates a protocol to normalize variants in VCF files, using left-alignment and parsimony. This makes sense for VCF, where a REF and an ALT are critical components, although why they dismiss the possibility of an empty ALT is not clear. To summarize the post, they recommend picking the minimal ref+alt combination in the left-most position of the alignment.

HGVS nomenclature, for deletions, recommends picking the most 3' of all possible nomenclatures a variant can be assigned. For example, if ATGCACACACATGG were to lose a CA, HGVS recommends it should be c.10_11delCA, and not c.4_5 or c.6_7 or c.8_9, because there is at least one position further 3', where the exact same deletion will result in an identical mutant sequence.

For the above case, normalization recommends REF be GCA and ALT be G, thus picking the most 5' position.

I understand that VCF and HGVS nomenclature have different aims, but how do we address this as we canonicalize variant notation across the board?

Including a few people that I've seen working on variants and HGVS conventions: dandan Jeremy Leipzig

hgvs variant • 3.8k views
ADD COMMENT
1
Entering edit mode

I've only recently worked on variant analysis, so I'm ignorant, but interested. In the example, it seems that c.10_11delCA is a different variant than c.4_5delCA, or c.6_7, c.8_9. Hence the choice of which of these HGVS names to use should be determined by which of the variants actually occurred, rather than the convention-based approach of selecting the most 3' option that leads to the same final mutated context sequence. The identification of which of the 4 possible HGVS variants occurred would be determined (or not) by the specificity of the assay. For example by using a probe that targets the c.10_11 deletion, versus each of the others. In this line of thinking, there is no contradiction between the two nomenclatures if the position of the variant is fully specified. No?

In the linked wiki, to my understanding they dismiss the possibility of an empty ALT because they say "The representation of variants in a VCF file requires that no alleles in the REF and ALT field are represented with an empty string".

ADD REPLY
1
Entering edit mode

The mol bio person in my lab said that it is not possible to determine exact location of deletion when repeats are involved, hence my assumption that the variants are equivalent. If it is indeed possible, then yes, such attribution makes sense.

And from what I've seen, VCF files use the deleted sequence as REF and a . as ALT for deletions, and the other way around for insertions.

ADD REPLY
2
Entering edit mode

If I may reply to this super old discussion simply because people are going to find this;

  • When the resulting DNA sequence is the same between different deletions, you won't be able to find out which bases were actually removed, and as such, you should always describe all possibilities as the same variant. For HGVS, this is always fully 3' shifted. This is also the only way the variants can be compared to other resources.
  • VCF is in no way normalized by default, and one variant can have many different descriptions. When 5' normalizing VCF files, make sure the tool you use, uses the reference sequence to fully 5' normalize and not just normalize given the context in the VCF file, which may not be complete. HGVS is always normalized 3', and non-normalized descriptions are not valid HGVS. Comparing VCF to HGVS can only be done by normalizing the variants again after or during conversion.
  • Empty REFs and ALTs are never allowed in VCF files. Sure, empty ALTs you can deal with, but empty REFs are ambiguous, and you'll have to assume where the insertion took place (at, or after the given position?). This can cause immense problems with "invented" variants that I unfortunately have experience with receiving, even (sadly) in the clinical setting.
ADD REPLY

Login before adding your answer.

Traffic: 4205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6