Hi, after Represent Precise Deletion In Vcf, I've got some more questions about other structural variants in VCF, so I try to put them all into this post.
Duplication
123 456 reference genome -----[ ]------------------------------------------- 123 456 789 1122 sample genome -----[ ]----------------[ ]--------------
Here is example of duplication, but I don't know how to interpret POS and END. Would END be 456 in this example? Or 1122? And what about POS?
I think with breakends it will look like this (let's say duplication occurs on chromosome 1):
#CHROM POS ID REF ALT QUAL FILTER INFO 1 788 . . .[1:123[ . . SVTYPE=BND;EVENT=DUP0 1 789 . . ]1:456]. . . SVTYPE=BND;EVENT=DUP0
But I also want to know how to use simpler way.
Translocation
123 456 reference genome -----[ ]------------------------------------------- 789 1122 sample genome ----------------------------------[ ]--------------
I think I can use entry about deletion and same entries like above for duplication:
#CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . . .<DEL> . . SVTYPE=DEL;END=456;SVLEN=-333;EVENT=TRANS0 1 788 . . .[1:123[ . . SVTYPE=BND;EVENT=TRANS0 1 789 . . ]1:456]. . . SVTYPE=BND;EVENT=TRANS0
But there is maybe another way how to store this.
Insertion
What if I don't know precise sequence of insertion? I know that I have to type
<INS>
intoALT
column, but what about this sequence? What first come to my mind is to create new meta information, something like this:##INFO=<ID=ISEQ,Number=1,Type=String,Description=“Imprecise inserted sequence”>
Then I can store it into
INFO
column and maybe create another meta informations which describe confidence about begin and end of this sequence:##INFO=<ID=CINSBEGIN,Number=1,Type=Integer,Description=“Confidence begin of inserted sequence”> ##INFO=<ID=CINSEND,Number=1,Type=Integer,Description=“Confidence end of inserted sequence”>
Example:
#CHROM POS ID REF ALT QUAL FILTER INFO 1 123 . . .<INS> . . SVTYPE=INS;END=123;ISEQ=ATTCGATCA;CINSBEGIN=2;CINSEND=1
I can interpret it like insertion of these possible sequences: ATTCGATCA, TTCGATCA, TCGATCA, ATTCGATC, TTCGATC, TCGATC. So I am sure about insertion of sequence TCGATC, but there could be possible prefixes (A, AT) and sufixes (A). I hope I made it clear.
Thanks for all your help.
This is not a real answer, but you might find this blog post quite useful.
http://core-genomics.blogspot.com/2011/07/understanding-mutation-nomenclature.html
the terminology and definitions are unexpectedly complicated, it is quite surprising how many corner cases and ambiguities exist