I have converted vcf file to bed files, and there are some duplicate SNPs. I would like to remove the duplicate SNPs, but keep one. For example, if rs1234 appears 5 times, I want to keep only one record (maybe the first one).
Right now I used --write-snplist
to get the snplist of the bed file, and use R to check the frequency of each snp, and use R to generate a duplicate snplist. With the duplicate snplist, I used --extract
to get the duplicate snp bed file, and --exclude
to get the bed file without any duplicate snp.
But how could I keep one snp for each duplicate snp? And also, is there a way to do the above steps in plink, without switching to R to generate the duplicate snp list?
Thank you.
I meant the same SNP was reported in a vcf file (1000G) for multiple times, in the same chromosome.
One example is chr22:18496882 rs35404796 was reported three times, the REF allele is always G, but the ALT are different ("GAC", "GACACAC", "GACACACAC")
Another case is rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254, and the REF and ALT are the same.
Your first example is not of SNPs; they are insertions, and they are different.
I'm not sure why you would want to do what you want to do, but I would write a program to iterate through the VCF file line by line, maintain a hashset of RSIDs, and only retain lines whose RSID has not been seen previously.
I was calculating the ld using
--r2
, but it returnsError: Duplicate ID 'rs10656307'.
It seems that this one is also insertions, the two rs10656307 records have same chr:pos (chr22:28698027), same REF (A), but different ALT (AAAT and AAATAAT). Therefore I want to exclude duplicate records.Oh, interesting; that's unfortunate. Well, I still recommend writing a quick program to remove the duplicate RSIDs, as I mentioned above. But if there are only a handful you could easily remove all copies of them via grep instead.
It does not seem to be handful, and it is a large dataset. iterate through the VCF line by line sounds to be very slow, but I will see if I could do that. Thanks.
Any tool which accomplished the task would have to iterate through the VCF line by line, though :)
Have you solved the duplicated problem?
You gave a example that rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254 in the 1000 Genome vcf file.
I ran into the same situation. I found rs13406140 (in chromosome2) occurs two times in the 1000 Genome vcf and the coordinate of the same RSID is unbelievably different. as follows: 2 90430223 rs13406140 G A 100 PASS 2 91651998 rs13406140 A G 100 PASS
I queried my doubt in 1000 Genome offficial Q&A and found a likely reply: Why are there duplicate calls in the phase 3 call set http://www.internationalgenome.org/category/variants/
I'm still in doubt about this, how can a RSID SNP map to two different position? Can anyone help?