Plink Error: Duplicate ID 'chr:pos:A1:A2' generated by --set-missing-var-ids

0

Entering edit mode

6.5 years ago

jamespower ▴ 100

Hi,

I am trying to work with the 1000Genomes data starting from the vcf file, which has missing IDs.

So, I thought of adding the flag --set-missing-var-ids @:#:\$1:\$2,

but this gives me the error:

Error: Duplicate ID generated by --set-missing-var-ids.

So I guess first we want to find the duplicate IDs, but by chr, pos, A1, and A2.

Is there a way to do this? I know there is a --list-duplicate-vars but this does not give me IDs since I have many missing IDs.

Any help would be very appreciated!

plink duplicates missing id • 8.6k views

ADD COMMENT • link updated 6.5 years ago by Ram 45k • written 6.5 years ago by jamespower ▴ 100

2

Entering edit mode

You're much better off using plink 2.0 for this operation, since (i) it lets you specify REF/ALT alleles when constructing the IDs, avoiding the duplicate indel problem, and (ii) it forces you to specify how very long allele codes should be handled instead of silently truncating them, preventing some nasty surprises later on.

It's also reasonable to use awk/cut for this, but if you do, you'll need to be careful with very long indels.

ADD REPLY • link 6.5 years ago by chrchang523 11k

0

Entering edit mode

Thanks! Using plink2, with --set-missing-var-ids @:#\$r:\$a, and --new-id-max-allele-len 23 missing works great!

ADD REPLY • link 6.5 years ago by jamespower ▴ 100

1

Entering edit mode

Have you considered a simple awk or cut followed by sort | uniq -c? It might also be worth looking into various options that bcftools offers.

ADD REPLY • link 6.5 years ago by Ram 45k

0

Entering edit mode

Indeed, I deal with this issue outside plink and using BCFtools. Take a look at Step 4, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

Login before adding your answer.