How to bypass SNPs with identical A1 and A2 alleles in PLINK?
1
0
Entering edit mode
9.9 years ago

Hello,

I am trying to merge two imputed (I used SHAPEIT and IMPUTE2) binary file sets with PLINK's bmerge command, but this error pops up:

Error: Identical A1 and A2 alleles on line 1

I am pretty sure I've got many single-allele SNPs in my data, so I was wondering if there is a quick way to solve this problem? I checked PLINK's manual, but there seems to be no way to ignore such SNPs or correct them.

I would like to avoid - if possible - to look for a solution in SHAPEIT or IMPUTE2, because prephasing and imputation already took a very long time to run.

Any ideas?

plink imputation merge shapeit impute2 • 4.4k views
ADD COMMENT
0
Entering edit mode

This might be due to an incompatibility between PLINK's Oxford import and the latest IMPUTE2 output format. Can you send me a small .gen/.sample fileset that generates this problem? (You can probably omit all lines in the .gen file past the first 3-4.)

ADD REPLY
0
Entering edit mode
9.9 years ago

Sure:

Gen (impute2) file (these are actually the first lines of the file - SNPs in the converted plink file have a different order):

--- 1:55565:G:A 55565 G A 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1
--- 1:55582:T:C 55582 T C 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1
--- 1:55588:T:C 55588 T C 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1

Sample file:

ID_1 ID_2 missing sex 1
0 0 0 D 1
_0005432396f6cb97_FAM _0005432396f6cb97 0 2 1
_0022055d7384cd08_FAM _0022055d7384cd08 0 2 1
_0121e9eb137f1c02_FAM _0121e9eb137f1c02 0 1 1
_0187a92841fa96a2_FAM _0187a92841fa96a2 0 2 1
_018e122ad9751cf7_FAM _018e122ad9751cf7 0 2 1

I hope this helps

Yorgos

ADD COMMENT
0
Entering edit mode

Did any of these SNPs end up with equal A1/A2 alleles? (I just tried importing and merging them, and did not have any problems.)

  • If only other SNPs were affected, could you find the corresponding .gen line for one of them?
  • If you actually did have problems with these specific SNPs, can you list the commands you used to import and merge your files?
ADD REPLY
1
Entering edit mode

Hi again,

No, I didn't have any problem with these specific SNPs, they just were the first ones in the impute2 file and I tried to do exactly what you asked me to (i.e. paste the 3-4 first lines). I guess that SNP order was changed when I converted from gen to plink format.

I have spotted the line in the impute2 file where the monomorphic SNP is:

1 rs12564807 734462 A A 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0
  • All the lines I've pasted here are incomplete - they are actually quite longer, involving many more individuals
  • I have used regex with grep + awk commands to create a list of such monomorphic SNPs and then --exclude them with PLINK. This actually resolves the issue, I was just wondering if PLINK has some built-in flag to ignore this problem, which by the way pops up only when I try to use the --merge commands.
ADD REPLY
0
Entering edit mode

There currently isn't a built-in command, since this generally indicates a data processing error that should be fixed at the source. But do you know what caused impute2 to generate a file with identical A1 and A2 allele codes here? If it's a routine occurrence, and it only happens with monomorphic SNPs, I will modify the .gen (and .bgen, if necessary) import routines to automatically zero out one of the allele codes here.

ADD REPLY
0
Entering edit mode

Hi,

After looking back at all the files, I found the following:

1. the original bim file I used in PLINK to prephase with SHAPEIT had this line for this specific SNP:

1    rs12564807    0    734462    0    A

2. Using that file, SHAPEIT returned this haplotype for the SNP (Showing only part of it here):

1 rs12564807 734462 A A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

I guess that that '0' (instead of 'G') is responsible for the issue.

Y.

ADD REPLY

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6