I am trying to convert some files from impute2 phased output (haps files),
Most of the lines look like this:
--- SNP BP A1 A2 ID1HAP1 ID2HAP2 ...
--- rs62224609 16051249 T C 0 0 0 0 0 0 0 1 0 0 1
--- rs62224610 16051347 G C 0 1 0 1 1 0 0 1 0 0 1
--- rs143503259 16051453 A C 0 0 0 0 0 0 0 1 0 0
--- rs192339082 16051477 C A 0 0 0 0 0 0 0 0 0 0
--- rs79725552 16051480 T C 0 0 0 0 0 0 0 0 0 0 0
--- rs141578542 16051497 A G 0 1 0 1 1 0 0 1 0 0
--- rs201906224 16051722 TA T 0 0 0 0 0 0 0 0 0 0
--- rs2843213 16051882 C T 0 0 0 0 0 0 0 0 0 0 0
--- rs4965031 16052080 G A 0 0 0 0 0 0 0 0 0 0 0
--- rs6518413 16052239 A G 0 0 0 0 0 1 0 0 0 0 0
Question 1: Does 0 stand for allele 1 and 1 for allele 2? For example, in this line:
--- rs6518413 16052239 A G 0 0 0 0 0 1 0 0 0 0 0
Does 0 stand for A and 1 for G here?
Question 2: Some lines contain multiple characters in the A1 or A2 column, what do those characters stand for? (I am guessing indel, but not sure.)
--- SNP BP A1 A2 ID1HAP1 ID2HAP2 ...
--- chr22:16078656 16078656 G GTGTC 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
--- rs199998412 16134558 ATAACT A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- rs201164934 16151190 TGCCTA T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- rs199662619 16166919 ATATTTTCTGCACATATT A 0 0 0 0 0 0 0 0 0 0 0 0
--- rs200691780 16197677 TAAAG T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- chr22:16231367 16231367 G GAGAA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
--- rs201020033 16368171 CAGAG C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- chr22:16380919 16380919 A AAAAT 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
--- rs141841004 16432988 GTACT G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- rs200126408 16459572 TATATATAG T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
For example, in this line:
--- chr22:16078656 16078656 G GTGTC 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
what does GTGTC stand for?
Here is a typical phased output from mach:
232->232 HAPLO1 TTGACCCCGATGTGTTAAGACCGTATCACTCCACTCTTCAAGTCGGGATTGTC
232->232 HAPLO2 TGACCCCCAACGTGCCAAGACCGTGCTGACCAGTCCTTTACACCGAAATTATT
2921->2921 HAPLO1 TTGAACCCAACGTGCCAATGCTATGCTACCCAGCTTCCCAAGCCGAAGTTG
2921->2921 HAPLO2 TGGCCCCCAATGTGCCAATGCTATGCTACCCAGTCCTCCGAGCTGAAGTTG
3370->3370 HAPLO1 TTGAACCCGACGTGTTAAGACCGTATCACTCCATCCTTCAAGTCGGGACTA
For each individual, there are two lines, one line for each haplotype. The third column contains genotype data, one letter for each SNP.
So, question 3: in case I want to convert impute2 phased output to mach phased output, how should I represent the mutations more complicated than a bi-allelic polymorphism?
Thanks! About 3(2), MaCH also has a phased output format, in which you can use 1,2,3 and 4 to represent A, C, G and T. But I don't know how to handle indels here.