Summary:
I have run into an issue where I have about 2,000 SNPs which occur twice in my plink binary files. That is, they have the same chromosome and position, but different alternate alleles. Using plink, I would like to keep only that with the highest alternate allele frequency for each identical chromosome:position pair. I can think of a few ways of doing this, but they are all a bit hacky. What approach do you suggest for this problem? Is there any approach that does not require scripting outside of plink?
What I've tried:
I already have a text file of these duplicates that was automatically generated when plink2 failed to concatenate the files. I'm aware I can remove duplicates automatically using the --rm-dup
flag. However, the closest option this flag has to my desired implementation is 'force-first'
which would only work if the SNPs were already ordered by alternate allele frequency and I'm not sure how to do this or if it's possible.
I also thought to calculate the allele frequencies and generate a list of SNPs to remove using --exclude
, but I would only want to exclude SNPs with a particular alternate allele and I couldn't find how to do this either. Finally, I definitely think I could implement this by making the variant IDs unique first using --set-missing-var-ids
which would resolve the issue of the previous approach, but this strikes me as quite a hacky approach to what seems like a simple problem.
Any suggestions will be much appreciated!
Can you join these variants (with e.g. "bcftools norm +m") instead? plink2 can handle multiallelic variants.
Otherwise, you're stuck with information-losing hacks.