Hi Everyone,
I'm confused, when manually converting genotypes from VCF file how to deal with 0/1 and 1/0 (I know that 0/0 is REF/REF and 1/1 is ALT/ALT). I have read other posts, like Difference between Genotype 0|1 and 1|0 in VCF file? and What Does Genetype ("0/0", "0/1" Or "1/1") In *.Vcf File Represent? ,however this does not describe what I need.
For example, here is a part of a sample vcf, I would like to manually convert it to a regular nucleotide sequence:
REF ALT Sample1
A G 0/0
C T 0/1
A T 0/0
A T 0/0
T A 0/0
A G 0/1
G C 0/1
G A 0/1
A T 0/1
C T 0/1
Is the sequence: A,T,A,A,T,G,C,A,T,T so REF,ALT,REF,REF,REF,ALT,ALT,ALT,ALT,ALT ? Please confirm and/or explain. Also, if it is 0/1 or 1/0 do I always choose the ALT (since REF is always 0)?
no, in fact there are two sequences here, because you're looking at a DIPLOID organism.
Yes it is diploid, you are right, but I'm wondering how can a sequence representig these bases can be translated. Can you help to untackle?
The Sample1 haploids is best represented as:
A[CT]AAT[AG][GC][GA][AT][CT]
. When one option is picked in each of the[XX]
bases, the other haploid immediately get the other base.As you can see, this is a combinatorial problem. There are 2^6 solutions to this.
All of the above assumes that the bases you've shown above are the only bases in the organism (which I highly doubt). Without a reference sequence, this operation is kind of meaningless.
just to add onto Ram's answer, if instead your VCF had phased variants and looked like this
then you actually get two distinct and knowable sequences that basically represent the two haplotypes of Sample 1:
this effectively represents the two different haplotypes. there is no combinatorial explosion, the phasing allows you to get the true sequences (you would also combine that sequence with the reference genome as Ram refers to, but the major point is the phasing allows truly separating the sequnce)
Yes, thank you, this might be the best way to deal with this situation.