Keep one SNP for duplicate SNPs
1
0
Entering edit mode
7.3 years ago
janhuang.cn ▴ 230

I have converted vcf file to bed files, and there are some duplicate SNPs. I would like to remove the duplicate SNPs, but keep one. For example, if rs1234 appears 5 times, I want to keep only one record (maybe the first one).

Right now I used --write-snplist to get the snplist of the bed file, and use R to check the frequency of each snp, and use R to generate a duplicate snplist. With the duplicate snplist, I used --extract to get the duplicate snp bed file, and --exclude to get the bed file without any duplicate snp.

But how could I keep one snp for each duplicate snp? And also, is there a way to do the above steps in plink, without switching to R to generate the duplicate snp list?

duplicate SNP • 4.7k views
ADD COMMENT
0
Entering edit mode
7.3 years ago
prasundutta87 ▴ 670

What do you mean by duplicate snps? Have they been reported multiple times as in same chromosome/contig with same ref and alt coordinates as well?

This post may be helpful..

How to filter out duplicate records in a vcf with bcftools?

ADD COMMENT
0
Entering edit mode

Thank you.

I meant the same SNP was reported in a vcf file (1000G) for multiple times, in the same chromosome.

One example is chr22:18496882 rs35404796 was reported three times, the REF allele is always G, but the ALT are different ("GAC", "GACACAC", "GACACACAC")

Another case is rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254, and the REF and ALT are the same.

ADD REPLY
0
Entering edit mode

Your first example is not of SNPs; they are insertions, and they are different.

I'm not sure why you would want to do what you want to do, but I would write a program to iterate through the VCF file line by line, maintain a hashset of RSIDs, and only retain lines whose RSID has not been seen previously.

ADD REPLY
0
Entering edit mode

I was calculating the ld using --r2, but it returns Error: Duplicate ID 'rs10656307'. It seems that this one is also insertions, the two rs10656307 records have same chr:pos (chr22:28698027), same REF (A), but different ALT (AAAT and AAATAAT). Therefore I want to exclude duplicate records.

ADD REPLY
0
Entering edit mode

Oh, interesting; that's unfortunate. Well, I still recommend writing a quick program to remove the duplicate RSIDs, as I mentioned above. But if there are only a handful you could easily remove all copies of them via grep instead.

ADD REPLY
0
Entering edit mode

It does not seem to be handful, and it is a large dataset. iterate through the VCF line by line sounds to be very slow, but I will see if I could do that. Thanks.

ADD REPLY
0
Entering edit mode

Any tool which accomplished the task would have to iterate through the VCF line by line, though :)

ADD REPLY
0
Entering edit mode

Have you solved the duplicated problem?

You gave a example that rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254 in the 1000 Genome vcf file.

I ran into the same situation. I found rs13406140 (in chromosome2) occurs two times in the 1000 Genome vcf and the coordinate of the same RSID is unbelievably different. as follows: 2 90430223 rs13406140 G A 100 PASS 2 91651998 rs13406140 A G 100 PASS

I queried my doubt in 1000 Genome offficial Q&A and found a likely reply: Why are there duplicate calls in the phase 3 call set http://www.internationalgenome.org/category/variants/

I'm still in doubt about this, how can a RSID SNP map to two different position? Can anyone help?

ADD REPLY

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6