Entering edit mode
9.3 years ago
Matt
▴
70
I am parsing a VCF file (indel not SNP) and stumbled upon this type of entry.
CHROM: 1
POSITION: 3010401
REF: GT
ALT: GGTTTTT
If ALT were GTTTTT, I would say that there is a 5 sequence insertion starting after the first G. But with the leading G, I don't understand. Is this both an insertion and a deletion? What bases are affected?
Please help!
Matt
a G at 3010401 and followed by 'T' is replaced by a G at 3010401 and followed by 'GTTTTT'. The VCF spec says you need to put the based 'before' the event when there's an indel.
Just a correction regarding the padding base. The VCF spec (since 4.1 onward) says you need to put a padding base if either the REF or ALT would otherwise be empty (i.e. a pure insertion or pure deletion). Thus the leading G in this variant is not required for padding reasons. A literal interpretation is that the variant caller has asserted that the base at position 3010401 is a G (as you note).
The notion of padding bases is actually a shortcoming of VCF, as it introduces uncertainty as to whether the caller really means what it is declaring regarding the bases actually present in the sample genome vs having added the base simply for syntactic reasons. This is becoming more important as in the clinical world people care about confidently calling locations as reference.
You are saying that it is as simple as the 2 REF bases are replaced by the 7 ALT bases?
Would it be the same if REF were AGT? The 3 REF bases would be replaced? Or is this an invalid indel record?
He's saying that in VCF's world, this represents a valid insertions between a G at position 3010401 and a T at 3010402.
The ref bases aren't replaced.
This insertion can be a bit more sanely called Pos Ref Alt 3010401 G GGTTTT
because the T at position 3010402 is completely irrelevant to position 3010401.
If the Ref were AGT, the Ref would be wrong, because 3010401 is a G not a T (presumably the reference given in your first answer is correct).
VCF makes variant calling confusing because of 2 fundamentally weird decisions. First, ALT always needs to have a base. Second, the ref can be any number of bases long, even for a single position.
Hello Matt!
It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=61540
This is typically not recommended as it runs the risk of annoying people in both communities.
My apologies
Please don't apologize, tl;dr what you did is the best way to gain exposure. There is nothing wrong with posing the same question to multiple internet communities. This 1) increases your chance of having the question answered, 2) increases the chance that more than one valid perspective will be demonstrated 3) increases the chance that your question, and corresponding answers will appear in a google search. This is especially important when using resources where valid answers cannot be upvoted, and therefore cannot increase in relevance except by word volume and links to / from answers (like seqanswers). It also seems anticompetitive for the creator of one website to ask his users not to use another.
To Pierre's credit, his response "annoyed" me enough to register and answering your remaining question.