Convert Impute2 phased output files to mach format
1
0
Entering edit mode
10.2 years ago
kindlychung ▴ 60

I am trying to convert some files from impute2 phased output (haps files),

Most of the lines look like this:

        --- SNP BP A1 A2 ID1HAP1   ID2HAP2 ...
        --- rs62224609 16051249 T C 0 0 0 0 0 0 0 1 0 0 1
        --- rs62224610 16051347 G C 0 1 0 1 1 0 0 1 0 0 1
        --- rs143503259 16051453 A C 0 0 0 0 0 0 0 1 0 0
        --- rs192339082 16051477 C A 0 0 0 0 0 0 0 0 0 0
        --- rs79725552 16051480 T C 0 0 0 0 0 0 0 0 0 0 0
        --- rs141578542 16051497 A G 0 1 0 1 1 0 0 1 0 0
        --- rs201906224 16051722 TA T 0 0 0 0 0 0 0 0 0 0
        --- rs2843213 16051882 C T 0 0 0 0 0 0 0 0 0 0 0
        --- rs4965031 16052080 G A 0 0 0 0 0 0 0 0 0 0 0
        --- rs6518413 16052239 A G 0 0 0 0 0 1 0 0 0 0 0

Question 1: Does 0 stand for allele 1 and 1 for allele 2? For example, in this line:

            --- rs6518413 16052239 A G 0 0 0 0 0 1 0 0 0 0 0

Does 0 stand for A and 1 for G here?

Question 2: Some lines contain multiple characters in the A1 or A2 column, what do those characters stand for? (I am guessing indel, but not sure.)

    --- SNP BP A1 A2 ID1HAP1 ID2HAP2 ...
    --- chr22:16078656 16078656 G GTGTC 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    --- rs199998412 16134558 ATAACT A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    --- rs201164934 16151190 TGCCTA T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    --- rs199662619 16166919 ATATTTTCTGCACATATT A 0 0 0 0 0 0 0 0 0 0 0 0
    --- rs200691780 16197677 TAAAG T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    --- chr22:16231367 16231367 G GAGAA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    --- rs201020033 16368171 CAGAG C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    --- chr22:16380919 16380919 A AAAAT 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    --- rs141841004 16432988 GTACT G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    --- rs200126408 16459572 TATATATAG T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

For example, in this line:

    --- chr22:16078656 16078656 G GTGTC 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

what does GTGTC stand for?

 Here is a typical phased output from mach:
        232->232 HAPLO1 TTGACCCCGATGTGTTAAGACCGTATCACTCCACTCTTCAAGTCGGGATTGTC
        232->232 HAPLO2 TGACCCCCAACGTGCCAAGACCGTGCTGACCAGTCCTTTACACCGAAATTATT
        2921->2921 HAPLO1 TTGAACCCAACGTGCCAATGCTATGCTACCCAGCTTCCCAAGCCGAAGTTG
        2921->2921 HAPLO2 TGGCCCCCAATGTGCCAATGCTATGCTACCCAGTCCTCCGAGCTGAAGTTG
        3370->3370 HAPLO1 TTGAACCCGACGTGTTAAGACCGTATCACTCCATCCTTCAAGTCGGGACTA

For each individual, there are two lines, one line for each haplotype. The third column contains genotype data, one letter for each SNP.

So, question 3: in case I want to convert impute2 phased output to mach phased output, how should I represent the mutations more complicated than a bi-allelic polymorphism?

imputation impute2 mach • 3.2k views
ADD COMMENT
1
Entering edit mode
10.2 years ago

Question 1: I think you are right. To be sure, how many columns do you have? If it's equal to the number of samples, each number is a genotype. If you have twice the number of samples, each number is an allele.

Question 2: Insertion / Deletion. Take one SNP and search it on UCSC. Ex: rs199998412 is a deletion (-/TAACT).

Question 3: 1) Why transform IMPUTE2 output to MaCH output? 2) MaCH output are number between 0 and 2. A1A1 in IMPUTE2 will be 0 in MaCH, A1A2 in IMPUTE2 will be 1 in MaCH and A2A2 in IMPUTE2 will be 2 in MaCH.

ADD COMMENT
0
Entering edit mode

Thanks! About 3(2), MaCH also has a phased output format, in which you can use 1,2,3 and 4 to represent A, C, G and T. But I don't know how to handle indels here.

ADD REPLY

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6