Entering edit mode
5.8 years ago
jamespower
▴
100
Hi,
I am trying to work with the 1000Genomes data starting from the vcf file, which has missing IDs.
So, I thought of adding the flag --set-missing-var-ids @:#:\$1:\$2
,
but this gives me the error:
Error: Duplicate ID generated by --set-missing-var-ids.
So I guess first we want to find the duplicate IDs, but by chr, pos, A1, and A2.
Is there a way to do this? I know there is a --list-duplicate-vars
but this does not give me IDs since I have many missing IDs.
Any help would be very appreciated!
You're much better off using plink 2.0 for this operation, since (i) it lets you specify REF/ALT alleles when constructing the IDs, avoiding the duplicate indel problem, and (ii) it forces you to specify how very long allele codes should be handled instead of silently truncating them, preventing some nasty surprises later on.
It's also reasonable to use awk/cut for this, but if you do, you'll need to be careful with very long indels.
Thanks! Using plink2, with
--set-missing-var-ids @:#\$r:\$a
, and--new-id-max-allele-len 23 missing
works great!Have you considered a simple
awk
orcut
followed bysort | uniq -c
? It might also be worth looking into various options that bcftools offers.Indeed, I deal with this issue outside plink and using BCFtools. Take a look at Step 4, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2