Question

Keep one SNP for duplicate SNPs

0

Entering edit mode

7.3 years ago

janhuang.cn ▴ 230

I have converted vcf file to bed files, and there are some duplicate SNPs. I would like to remove the duplicate SNPs, but keep one. For example, if rs1234 appears 5 times, I want to keep only one record (maybe the first one).

Right now I used --write-snplist to get the snplist of the bed file, and use R to check the frequency of each snp, and use R to generate a duplicate snplist. With the duplicate snplist, I used --extract to get the duplicate snp bed file, and --exclude to get the bed file without any duplicate snp.

But how could I keep one snp for each duplicate snp? And also, is there a way to do the above steps in plink, without switching to R to generate the duplicate snp list?

duplicate SNP • 4.7k views

ADD COMMENT • link updated 6.9 years ago by Biostar 20 • written 7.3 years ago by janhuang.cn ▴ 230

score 0 · Answer 1 · 2017-08-23

0

Entering edit mode

7.3 years ago

prasundutta87 ▴ 670

What do you mean by duplicate snps? Have they been reported multiple times as in same chromosome/contig with same ref and alt coordinates as well?

This post may be helpful..

How to filter out duplicate records in a vcf with bcftools?

ADD COMMENT • link 7.3 years ago by prasundutta87 ▴ 670

0

Entering edit mode

Thank you.

I meant the same SNP was reported in a vcf file (1000G) for multiple times, in the same chromosome.

One example is chr22:18496882 rs35404796 was reported three times, the REF allele is always G, but the ALT are different ("GAC", "GACACAC", "GACACACAC")

Another case is rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254, and the REF and ALT are the same.

ADD REPLY • link 7.3 years ago by janhuang.cn ▴ 230

0

Entering edit mode

Your first example is not of SNPs; they are insertions, and they are different.

I'm not sure why you would want to do what you want to do, but I would write a program to iterate through the VCF file line by line, maintain a hashset of RSIDs, and only retain lines whose RSID has not been seen previously.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

I was calculating the ld using --r2, but it returns Error: Duplicate ID 'rs10656307'. It seems that this one is also insertions, the two rs10656307 records have same chr:pos (chr22:28698027), same REF (A), but different ALT (AAAT and AAATAAT). Therefore I want to exclude duplicate records.

ADD REPLY • link 7.3 years ago by janhuang.cn ▴ 230

0

Entering edit mode

Oh, interesting; that's unfortunate. Well, I still recommend writing a quick program to remove the duplicate RSIDs, as I mentioned above. But if there are only a handful you could easily remove all copies of them via grep instead.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

It does not seem to be handful, and it is a large dataset. iterate through the VCF line by line sounds to be very slow, but I will see if I could do that. Thanks.

ADD REPLY • link 7.3 years ago by janhuang.cn ▴ 230

0

Entering edit mode

Any tool which accomplished the task would have to iterate through the VCF line by line, though :)

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

Have you solved the duplicated problem?

You gave a example that rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254 in the 1000 Genome vcf file.

I ran into the same situation. I found rs13406140 (in chromosome2) occurs two times in the 1000 Genome vcf and the coordinate of the same RSID is unbelievably different. as follows: 2 90430223 rs13406140 G A 100 PASS 2 91651998 rs13406140 A G 100 PASS

I queried my doubt in 1000 Genome offficial Q&A and found a likely reply: Why are there duplicate calls in the phase 3 call set http://www.internationalgenome.org/category/variants/

I'm still in doubt about this, how can a RSID SNP map to two different position? Can anyone help?

ADD REPLY • link 6.3 years ago by keryruo ▴ 20