PLINK --update-name command error due to multiple SNPs with same chr:pos but different rs numbers in reference dataset
1
1
Entering edit mode
9.0 years ago
dam4l ▴ 200

Hi,

I have a data file containing autosomal SNPs imputed from the 1000 genomes data. The SNPs in my file are named as chr:pos but I want them to be named by rs number. I downloaded the 1000 genomes phase 1 data from the PLINK resources site, excluded sex chromosomes, and organized the file so that I have 2 colums: chr:pos (column 1) and corresponding rs number (column 2). I then tried to use PLINK --update-name command to update the SNP names in my file:

./plink --bfile my_data --update-name 1000_genomes_chrpos_rs.txt --make-bed --out my_data2

I got back the following error message:

Error: Duplicate variant ID '1:2351395' in --update-name file

In the 1000 genomes file, this (and likely other) chr:pos corresponds to multiple rs numbers. Is there a way to rectify this or modify the PLINK command so that I can change the naming of the SNPs in my data file from chr:pos to rs number?

Thanks so much!

PLINK SNP • 9.7k views
ADD COMMENT
0
Entering edit mode

Hi,

I have exactly same problem now. I was wondering, have you figured out this problem to remove the duplicate variant ID?

Thank you.

ADD REPLY
0
Entering edit mode

I am wondering too how did you solve this problem?

ADD REPLY
0
Entering edit mode

Hi, You can use unix/linux command to remove or rename duplicated or triplicated lines of your file. Here I'm presenting example assuming that you want to make column 2 unique (you test with small file first). It will add _0 _1 _2 etc. to duplicated values. For example, if your file has 2 columns 18 15 44 16 55 15 77 15 will be turned to numbers 18 15_0 44 16_0 55 15_1 77 15_2 (note that changes are only in column 2) The next pipe (sed 's/_0//' ) removes _0 and keeps other _2 etc. 18 15 44 16 55 15_1 77 15_2 (so, the second column will have unique values)

The command is (I'm assuming you have 6 columns, if the number is different remove or add $3, $4 etc.):

awk '{print $1, $2"_"x[$2]++, $3, $4, $5, $6}' update_file's_name | sed 's/_0//' > result_file_name

If you would like to remove all other underscores like 15_2 15_3 etc you can proceed with extending pipe to grep -v _\ To use column 1 you need to replace $2_\x[$2]++ with $1_\x[$1]++

I hope it helps, Thanks

ADD REPLY
0
Entering edit mode
9.0 years ago

PLINK can't choose which rs number to use when you have two rs number at the same position in your --update-name file. You need to modify the --update-name file before using PLINK. You could use R to merge your bim file with the PLINK ref file and when you have duplicate, keep the rs number with the same alleles as your SNP.

ADD COMMENT

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6