Question

Remove duplicate SNPs by allele frequency in PLINK

0

Entering edit mode

3.6 years ago

rem • 0

Summary:

I have run into an issue where I have about 2,000 SNPs which occur twice in my plink binary files. That is, they have the same chromosome and position, but different alternate alleles. Using plink, I would like to keep only that with the highest alternate allele frequency for each identical chromosome:position pair. I can think of a few ways of doing this, but they are all a bit hacky. What approach do you suggest for this problem? Is there any approach that does not require scripting outside of plink?

What I've tried:

I already have a text file of these duplicates that was automatically generated when plink2 failed to concatenate the files. I'm aware I can remove duplicates automatically using the --rm-dup flag. However, the closest option this flag has to my desired implementation is 'force-first' which would only work if the SNPs were already ordered by alternate allele frequency and I'm not sure how to do this or if it's possible.

I also thought to calculate the allele frequencies and generate a list of SNPs to remove using --exclude, but I would only want to exclude SNPs with a particular alternate allele and I couldn't find how to do this either. Finally, I definitely think I could implement this by making the variant IDs unique first using --set-missing-var-ids which would resolve the issue of the previous approach, but this strikes me as quite a hacky approach to what seems like a simple problem.

Any suggestions will be much appreciated!

QC plink • 934 views

ADD COMMENT • link updated 3.6 years ago by chrchang523 11k • written 3.6 years ago by rem • 0

1

Entering edit mode

Can you join these variants (with e.g. "bcftools norm +m") instead? plink2 can handle multiallelic variants.

Otherwise, you're stuck with information-losing hacks.

ADD REPLY • link 3.6 years ago by chrchang523 11k