I am working with PLINK to analyse SNP chip data.
Does anyone know how to remove duplicated SNPs (duplicated by position)?
I am working with PLINK to analyse SNP chip data.
Does anyone know how to remove duplicated SNPs (duplicated by position)?
Options in effect:
--bfile HighDensity
--exclude supress-first plink.dupvar
--out HighDensity.DuplicatesRemoved
Error: Invalid --exclude parameter sequence.
it doesn't accept exclude
, any other option?
Oh, sorry, "suppress-first" is a --list-duplicate-vars modifier, not an --exclude modifier. You also want to include "ids-only" in your --list-duplicate-vars step when your goal is just to generate a file for --exclude's use.
So, the following should work:
plink --bfile HighDensity --list-duplicate-vars ids-only suppress-first
plink --bfile HighDensity -exclude plink.dupvar --out HighDensity.DuplicatesRemoved
Hi Kevin and chrchang523, many thanks for prompt help. Well, my first problem is sorted with above commands, thanks for that. My ultimate goal is to merge two plink files (HighDensity and LowDensity), after removing the duplicates in HighDensity, when I tried to merge two files I still got this warnings:
Warning: Variants 'oar3_OAR1_169947942' and 'OAR1_183082360.1' have the same position.
Warning: Variants 'oar3_OAR1_224944169' and 'OAR1_242468071.1' have the same position.
Warning: Variants 'oar3_OAR2_34247858' and 'OAR2_35595985.1' have the same position.
Warning: Variants 's46929.1' and 'oar3_OAR2_74519162' have the same position.
Warning: Variants 's73098.1' and 'oar3_OAR3_20274083' have the same position.
Warning: Variants 'oar3_OAR3_34411557' and 'OAR3_36830729.1' have the same position.
It seems some variants (42) still don't have right position, any idea how i can fix that?
What if you try without ids-only
?
plink --bfile HighDensity --list-duplicate-vars suppress-first
plink --bfile HighDensity -exclude plink.dupvar --out HighDensity.DuplicatesRemoved
Also, if you try the merge command without removing duplicates, from my experience, will automatically identify the duplicates between the datasets you're merging and will output them to a file, which can then be used with --exclude
It doesn't work without ids-only
, same error as with ids-only
. Well, I tried to merge without removing duplicates, but same warning messages. I am using PLINK v1.90b3v 64-bit (15 Jul 2015). Do you think it will make difference if i try some other version? what version you were using for merging? I use this for merge:
plink --bfile LowDensity --bmerge HighDensity --make-bed --out mergewithduplicates
Hey, I go through the process of duplicate removal and then merging of datasets here: Produce PCA for 1000 Genomes Phase III in VCF format
In summary, how I normally do it is:
plink --noweb --bfile MyData --list-duplicate-vars ;
cut -f4 plink.dupvar | cut -f1 -d" " > Duplicates.list ;
plink --noweb --bfile MyData --exclude Duplicates.list --make-bed --out MyDataLessDuplicates ;
For merging, I use:
plink --merge-list ForMerge.list --out MergedData ;
ForMerge.list just contains the PLINK datasets that I want to merge (BIM files).
Nota bene:
I think that the key that my colleague and I may have missed is that the output of --list-duplicate-vars
, i.e., plink.dupvar, is not compatible with --exclude
. --exclude
just expects a single-column file of variant IDs that you want to exclude (or variant IDs separated by spaces). So, something like:
oar3_OAR1_169947942
OAR1_183082360.1
oar3_OAR1_224944169
OAR1_242468071.1
oar3_OAR2_34247858
OAR2_35595985.1.
s46929.1
oar3_OAR2_74519162
s73098.1
oar3_OAR3_20274083
oar3_OAR3_34411557
With --list-duplicate-vars
I would still add suppress-first
after it, just so that 1 copy of the duplicate will be retained (otherwise, both ar removed).
I also use PLINK 1.9, and virtually the same release as you:
/Programs/plink1.90/plink
PLINK v1.90b3.38 64-bit (7 Jun 2016) https://www.cog-genomics.org/plink2
(C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3
--list-duplicate-vars
's ids-only
modifier bridges the gap with --exclude
.
The --merge-equal-pos
flag causes variants with identical positions to be merged, keeping the IDs in the first file. However, this is conservative: if there's a single pair of same-position variants which have more than 2 distinct alleles between them, the merge will error out. plink 1.x is not designed to handle triallelic SNPs.
Yes, this is why I always normalise VCFs with:
bcftools norm -m-any My.VCF | bcftools norm --check-ref w -f MyRefGenome.fasta | bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT' > Normalised.bcf ;
bcftools index Normalised.bcf ;
That code also assigns a unique identifier to each variant in the ID field of the VCF, as chr:pos:pos:ref:alt.
...and then read these into PLINK.
Also, why do you use -x ID -I +<string>
? Would just -I <string>
suffice?
https://samtools.github.io/bcftools/bcftools-man.html#annotate
Yes, but the rs ID system of naming variants is problematic because it's not curated sufficiently. A single rs ID can relate to multiple variants at the same position. Also, a single rs ID can relate to variants at more than 1 position in the genome. The fundamental issue being that rs IDs are not unique at all...
It's too risky working with rs IDs on clinical data, where mix-ups / mess-ups just aren't allowed...
In the code above, you have to use -x
to first delete the ID field. Without -x
, it will not update it to what you specify with -I
.
Yeah, we don't use rsIDs for anything significant, all annotation is done by position. The manual says By default all existing IDs are replaced. If the format string is preceded by "+", only missing IDs will be set.
, so I don't see why -x
would be needed unless the documentation is wrong.
plink --bfile HighDensity --list-duplicate-vars --out test
plink --bfile HighDensity --exclude test.dupvar --out test1
Sorry, I couldn't get where --exclude suppress-first
comes in above command, can you please elaborate?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The --list-duplicate-vars suppress-first parameter allows you to identify them, with --exclude in a following command then allowing you to remove them. They are identified by having the same position and ref/var calls.
Why would a VCF have variants with the same chr, bp AND alleles? Spliiting multi-allelic variants can produce position-based "duplicates", but position AND alleles duplicated? That is just dirty data!
It happens quite frequently, unfortunately. Sometimes, even the same SNP rs ID can refer to 2 different locations in the genome. It's due to different genome builds.