Question

How to remove duplicated SNPs based on MAF with Plink?

0

Entering edit mode

11 months ago

giulia.trauzzi ▴ 30

Hi everyone,

I am trying to remove duplicated SNPs from my pgen dataset. These duplicated SNPs are the result of splitting multiallelic loci but now I just want to retain only the genotype that has higher maf, the most common. Is there a way to do this with Plink2? Considering that the most common genotype is not always the first instance in the data so I cannot use the --rm-dup first etc... Is there a way I can do this?

Many thanks.

Giulia

SNPs Plink2 duplication maf • 1.6k views

ADD COMMENT • link updated 5 days ago by Kevin Blighe 89k • written 11 months ago by giulia.trauzzi ▴ 30

1

Entering edit mode

Since the MAF values are different, is it possible to filter your result based on MAF first? And then see whether the duplicates are still there or not.

ADD REPLY • link 11 months ago by hakimbazol ▴ 60

0

Entering edit mode

Hi! That is exactly what I ended up doing. I filtered based on MAF and was left with a bunch of multiallelic SNPs... for some of them the first occurrence was the allele with lower AF and for some it wasn't, but in the end, I had like 8 multiallelic SNPs left in the data compared to the 60000 of the dataset so I just used rm-dup as is.

Thanks!

ADD REPLY • link 9 months ago by giulia.trauzzi ▴ 30

1

Entering edit mode

# Calculate allele frequencies for all variants
plink2 --pfile ${input_prefix} \
       --freq \
       --out ${params.output_prefix}

Python script to process the frequency file:

import pandas as pd
# Read the frequency file
df = pd.read_csv("${freq_file}", delim_whitespace=True)

# Extract the base variant ID (removing the split variant suffix if present)
df['base_id'] = df['ID'].str.split('_').str[0]

# For each group of duplicates, keep the one with highest MAF
# MAF is min(ALT_FREQS, 1-ALT_FREQS)
df['MAF'] = df['ALT_FREQS'].apply(lambda x: min(float(x), 1-float(x)))
to_keep = df.sort_values('MAF', ascending=False).drop_duplicates('base_id', keep='first')

# Write the list of variants to keep
to_keep['ID'].to_csv("${params.output_prefix}.keep.variants", index=False, header=False)

Create the final filtered dataset

plink2 --pfile ${input_prefix} \
       --extract ${variants_to_keep} \
       --make-pgen \
       --out ${params.output_prefix}

I generated this answer using amplicon.ai, a tool I've been building to make writing biofinformatics code easier. Feel free to try it out enter image description here

ADD REPLY • link 11 months ago by Semir ▴ 60

0

Entering edit mode

Hi, this is actually pretty cool and useful! Thanks for posting this. I did not use this for my preliminary analyses but I will definitely look this up in the future. Many thanks!

ADD REPLY • link 9 months ago by giulia.trauzzi ▴ 30

1

Entering edit mode

I would combine this with logic to generate a new .pvar, with MAFs included in the variant IDs. --extract cannot distinguish between two variants with identical IDs.

ADD REPLY • link 9 months ago by chrchang523 11k

score 0 · Answer 1 · 2025-11-07

Hi Giulia,

PLINK2 does not have a direct flag to prune duplicates while prioritising the higher-MAF variant, but you can handle this in a few steps. First, run --rm-dup list to generate a file of duplicate variant IDs (this identifies groups without removing anything). Next, use --freq to export allele frequencies across your dataset, noting the MAF for each variant. Then, for every duplicate group in the list, manually select the one with the highest MAF and compile those into a simple keep-variants text file (one ID per line). Finally, apply --extract using that file to retain only the chosen genotypes.

This approach works irrespective of the variants' order in the file.

Kevin