How to remove duplicated SNPs based on MAF with Plink?
0
0
Entering edit mode
18 days ago

Hi everyone,

I am trying to remove duplicated SNPs from my pgen dataset. These duplicated SNPs are the result of splitting multiallelic loci but now I just want to retain only the genotype that has higher maf, the most common. Is there a way to do this with Plink2? Considering that the most common genotype is not always the first instance in the data so I cannot use the --rm-dup first etc... Is there a way I can do this?

Many thanks.

Giulia

SNPs Plink2 duplication maf • 473 views
ADD COMMENT
0
Entering edit mode

Since the MAF values are different, is it possible to filter your result based on MAF first? And then see whether the duplicates are still there or not.

ADD REPLY
0
Entering edit mode
# Calculate allele frequencies for all variants
plink2 --pfile ${input_prefix} \
       --freq \
       --out ${params.output_prefix}

Python script to process the frequency file:

import pandas as pd
# Read the frequency file
df = pd.read_csv("${freq_file}", delim_whitespace=True)

# Extract the base variant ID (removing the split variant suffix if present)
df['base_id'] = df['ID'].str.split('_').str[0]

# For each group of duplicates, keep the one with highest MAF
# MAF is min(ALT_FREQS, 1-ALT_FREQS)
df['MAF'] = df['ALT_FREQS'].apply(lambda x: min(float(x), 1-float(x)))
to_keep = df.sort_values('MAF', ascending=False).drop_duplicates('base_id', keep='first')

# Write the list of variants to keep
to_keep['ID'].to_csv("${params.output_prefix}.keep.variants", index=False, header=False)

Create the final filtered dataset

plink2 --pfile ${input_prefix} \
       --extract ${variants_to_keep} \
       --make-pgen \
       --out ${params.output_prefix}

I generated this answer using amplicon.ai, a tool I've been building to make writing biofinformatics code easier. Feel free to try it out enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 1863 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6