Manual conversion of VCF genotypes to nucleotide sequence
2
0
Entering edit mode
5 weeks ago
endretoth ▴ 40

Hi Everyone,

I'm confused, when manually converting genotypes from VCF file how to deal with 0/1 and 1/0 (I know that 0/0 is REF/REF and 1/1 is ALT/ALT). I have read other posts, like Difference between Genotype 0|1 and 1|0 in VCF file? and What Does Genetype ("0/0", "0/1" Or "1/1") In *.Vcf File Represent? ,however this does not describe what I need.

For example, here is a part of a sample vcf, I would like to manually convert it to a regular nucleotide sequence:

REF ALT Sample1
A     G     0/0 
C     T     0/1
A     T     0/0
A     T     0/0
T     A     0/0
A     G     0/1
G     C     0/1
G     A     0/1
A     T     0/1
C     T     0/1

Is the sequence: A,T,A,A,T,G,C,A,T,T so REF,ALT,REF,REF,REF,ALT,ALT,ALT,ALT,ALT ? Please confirm and/or explain. Also, if it is 0/1 or 1/0 do I always choose the ALT (since REF is always 0)?

vcf nucleotide sequence • 752 views
ADD COMMENT
1
Entering edit mode

Is the sequence: A,T,A,A,T,G,C,A,T,T ?

no, in fact there are two sequences here, because you're looking at a DIPLOID organism.

ADD REPLY
0
Entering edit mode

Yes it is diploid, you are right, but I'm wondering how can a sequence representig these bases can be translated. Can you help to untackle?

ADD REPLY
2
Entering edit mode

The Sample1 haploids is best represented as: A[CT]AAT[AG][GC][GA][AT][CT]. When one option is picked in each of the [XX] bases, the other haploid immediately get the other base.

As you can see, this is a combinatorial problem. There are 2^6 solutions to this.

All of the above assumes that the bases you've shown above are the only bases in the organism (which I highly doubt). Without a reference sequence, this operation is kind of meaningless.

ADD REPLY
1
Entering edit mode

just to add onto Ram's answer, if instead your VCF had phased variants and looked like this

REF ALT Sample1
A     G     0|0 
C     T     0|1
A     T     0|0
A     T     0|0
T     A     0|0
A     G     0|1
G     C     0|1
G     A     0|1
A     T     0|1
C     T     0|1

then you actually get two distinct and knowable sequences that basically represent the two haplotypes of Sample 1:

ACAATAGGAC
ATAATGCATT

this effectively represents the two different haplotypes. there is no combinatorial explosion, the phasing allows you to get the true sequences (you would also combine that sequence with the reference genome as Ram refers to, but the major point is the phasing allows truly separating the sequnce)

ADD REPLY
0
Entering edit mode

Yes, thank you, this might be the best way to deal with this situation.

ADD REPLY
1
Entering edit mode
5 weeks ago
dsull ★ 7.3k

I’ve done this conversion before for mouse strains (like lifting over c57bl6j genome to castaneous). You just ignore the heterozygous records.

You could just use an existing tool for this purpose btw.

ADD COMMENT
0
Entering edit mode

Do you mean on ignore to use REF for all 0/1 and 1/0?

ADD REPLY
0
Entering edit mode

Yes, that’s the same thing, is it not? (I’m assuming you are doing the conversion from the ref genome).

ADD REPLY
0
Entering edit mode

My vcf file is from de novo assembly, I do not have any reference at all, unfortunately.

ADD REPLY
1
Entering edit mode
4 weeks ago
cmdcolin ★ 4.2k

as other comments have mentioned, there are various considerations, assumptions, and complexity that might be simplified in doing this type of operation but...all that said

you can "apply" a VCF to a "reference genome" using: bcftools consensus https://samtools.github.io/bcftools/bcftools.html#consensus

ADD COMMENT
0
Entering edit mode

Thank you for the suggestion, however the problem is that my species doesnt have a reference genome, that is why we used de novo assembly.

ADD REPLY
0
Entering edit mode

it is unclear to me what you mean by this. a VCF file will have coordinates that indicate the positions of variants relative to some sort of "reference". that reference can be your de novo assembly.

ADD REPLY

Login before adding your answer.

Traffic: 2115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6