How to remove duplicate SNPs from plink ped,vcf?
1
0
Entering edit mode
6.4 years ago
Tania ▴ 180

Hi Every one

I used to get this error when I try to split a vcf I get from plink format:

[E::bcf_hdr_add_sample] Duplicated sample name '103_Sp676'

These are my steps to remove duplicates, however I still get the same error:

data.bed: 
        plink --file data --maf 0.05 --make-bed --out data

data.DuplicatesRemoved.bed: 
        plink --bfile data --list-duplicate-vars ids-only suppress-first
        plink --bfile data -exclude plink.dupvar --make-bed --out data.DuplicatesRemoved

data.DuplicatesRemoved.ped: 
        plink --bfile data.DuplicatesRemoved --recode --tab --out data.DuplicatesRemoved

data.DuplicatesRemoved.vcf: 
        plink --file data.DuplicatesRemoved --recode vcf --out data.DuplicatesRemoved

Any help how to fix?

Thanks

SNP plink • 5.3k views
ADD COMMENT
0
Entering edit mode

Also added plink tag to your post. That way, it may be picked up by the person who is much more experienced in plink than anyone else here on Biostars.

ADD REPLY
1
Entering edit mode
6.4 years ago

Your issue is a duplicate sample name, not duplicate variants. Your sample that's duplicated is 103_Sp676.

Please try to understand why this sample is duplicated, and then manage the issue appropriately.

Kevin

ADD COMMENT
0
Entering edit mode

Thanks Kevin. I have many smaples duplicated like this. How can I understand the reason of duplication? Is it something to check with the data generation itself? or something computational I look to find? Sorry,seems naiive, but I am new here completely :)

ADD REPLY
1
Entering edit mode

Oh hey Tania. I answered and not realising it was you! I would have been nicer :)

What is the source of the data?

ADD REPLY
0
Entering edit mode

No worries :) Thanks alot for helping me :)

This is a snp-array handled to me few days ago, for some patients. I have to find out the reason for the phenotype they have. So I am trying to get a vcf then go from here. Each of these codes Spxxx , Fxxx is a patient, so I really don't know why they are duplicated, specially sometimes the data in the ped is slightly different in the duplication. Like at some position it is a G in the ped, the same position in the duplciate it is a zero? So they are not even the same to manually remove.

ADD REPLY
0
Entering edit mode

Hmm... maybe replicates of the same sample?

For recoding as VCF, you may want to try VCF-FID or VCF-IID, as mentioned here: https://www.cog-genomics.org/plink/1.9/data#recode

That will most likely produce the same issue, in which case you could update the sample IDs: https://www.cog-genomics.org/plink/1.9/data#update_indiv

For that to work smoothly, you should know the exact order of the samples in the PED file.

One wonders how they created the duplicate samples in the first place.

ADD REPLY
0
Entering edit mode

thanks Kevin so much. I will follow the links you mentioned and see. Thanks so much.

ADD REPLY

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6