Hello All,
We are designing a pipeline that will take phased data as input. We will ultimately be using a phased dataset provided by another group. Until this arrives we would like to practice with some phased data. We have data in VCF files that we would like to have with phase information. So output should also be VCF format. Can anyone recommend a fast way to get phase information from, and ultimately in, VCF format. Here the emphasis is on speed, we want phased data as fast as possible as dummy data and are not concerned with error rate (this once). Thank you in advance for your comments.
Best,
Rubal
When your VCF is generated by GATK, phasing is encoded in the 1|0, 0|1 format.
See: http://www.broadinstitute.org/gsa/wiki/index.php/Read-backed_phasing_algorithm
In what format will they supply the phased data? Are you sure is VCF? I was under the impression that VCF does not maintain phased data (alleles are swappable, no assurance of maintaining order)
no, vcf maintains the phase. If the two genotypes are separated by a pipe (e.g. 0|1) it means that they are phased; if they are separated by a slash (e.g. 0/1), they are unphased. http://www.1000genomes.org/node/101
I changed the title of your question because I understood that you are asking about how to get phasing data from vcf files. Please correct it if I am wrong.
I actually meant how do I phase unphased data that is in VCF format. Sorry I was away from this post for a while. But still interested in an answer
I found this description to be the most helpful for understanding how phasing information is represented in a VCF file: http://gatkforums.broadinstitute.org/gatk/discussion/45/purpose-and-operation-of-read-backed-phasing
It has nice intuitive examples of what the file actually looks like for phased and unphased variants.