While working to get this issue fixed in VarScan, I'm attempting to generate (or rather correct from the original output) a VCF record for two samples, each with a different indel at the same position.
To make it simple, the situation is:
- First reference base: C
- Indel in sample 1: CAA -> C (loss of 2 bases)
- Indel in sample 2: CA -> C (loss of 1 base)
I know from the data that this is likely an artifact (low coverage region) but still I need to generate a proper record for it or my analysis pipeline will not work (the GATK will complain about an invalid record, see the last post in the link for more details).
How would I go to represent this in a VCF? In particular, how should I represent the REF and ALT records? Should I split this in two records, or keep everything in one?
Thanks!
For now I'm assuming that the reference sequence is the longest (CAA) , sample 1 has C as ALT allele, and sample 2 CA as ALT (so ALT is
C,CA
). Am I going in the right direction?That's how I would also read the VCF spec. (namely, REF=
CAA
and ALT=CA,C
).what about just using the comma to separate all the possible variants, in the ALT column?
But in one case the reference would be CAA, and in the other CA. In both cases the deletion is represented as C, but it is the affected reference sequence that changes.
In principle there it should be only one reference allele. What is the sequence of the reference genome at NCBI, for that position?
The problem is how to make it "proper" inside the VCF. The first base in the reference is C. Then we have a stretch of As. So (see my comment below) in fact it is the REF bit that should be writen in a different way.