Question

How do I remove duplicated by position SNPs using PLink?

1

Entering edit mode

7.7 years ago

bha ▴ 80

I am working with PLINK to analyse SNP chip data.

Does anyone know how to remove duplicated SNPs (duplicated by position)?

SNP plink PLINK • 33k views

ADD COMMENT • link updated 7.7 years ago by Kevin Blighe 89k • written 7.7 years ago by bha ▴ 80

0

Entering edit mode

The --list-duplicate-vars suppress-first parameter allows you to identify them, with --exclude in a following command then allowing you to remove them. They are identified by having the same position and ref/var calls.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Why would a VCF have variants with the same chr, bp AND alleles? Spliiting multi-allelic variants can produce position-based "duplicates", but position AND alleles duplicated? That is just dirty data!

ADD REPLY • link 7.7 years ago by Ram 45k

0

Entering edit mode

It happens quite frequently, unfortunately. Sometimes, even the same SNP rs ID can refer to 2 different locations in the genome. It's due to different genome builds.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

Ram · Answer 1 · 2017-11-03

1

Entering edit mode

7.7 years ago

bha ▴ 80

Options in effect:
  --bfile HighDensity
  --exclude supress-first plink.dupvar
  --out HighDensity.DuplicatesRemoved

Error: Invalid --exclude parameter sequence.

it doesn't accept exclude, any other option?

ADD COMMENT • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by bha ▴ 80

0

Entering edit mode

"suppress-first" is misspelled in your command line.

ADD REPLY • link 7.7 years ago by chrchang523 11k

0

Entering edit mode

I tried with correct spelling, and --extract as well, but same error message

Options in effect:
  --bfile HighDensity
  --extract suppress-first plink.dupvar
  --out HighDensity.DuplicatesRemoved

Error: Invalid --extract parameter sequence.

ADD REPLY • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by bha ▴ 80

3

Entering edit mode

Oh, sorry, "suppress-first" is a --list-duplicate-vars modifier, not an --exclude modifier. You also want to include "ids-only" in your --list-duplicate-vars step when your goal is just to generate a file for --exclude's use.

So, the following should work:

plink --bfile HighDensity --list-duplicate-vars ids-only suppress-first
plink --bfile HighDensity -exclude plink.dupvar --out HighDensity.DuplicatesRemoved

ADD REPLY • link 7.7 years ago by chrchang523 11k

1

Entering edit mode

Here, you might be have a bug. When there is duplicated SNP with same ID, --exclude will remove all the SNPs simultaneously. If you want to --exclude, be sure to change the ID name to different one advanced.

ADD REPLY • link 7.0 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

Hi bha, I would highly recommend that you follow the guidance of chrchang523 for this particular problem.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin and chrchang523, many thanks for prompt help. Well, my first problem is sorted with above commands, thanks for that. My ultimate goal is to merge two plink files (HighDensity and LowDensity), after removing the duplicates in HighDensity, when I tried to merge two files I still got this warnings:

Warning: Variants 'oar3_OAR1_169947942' and 'OAR1_183082360.1' have the same position.
Warning: Variants 'oar3_OAR1_224944169' and 'OAR1_242468071.1' have the same position.
Warning: Variants 'oar3_OAR2_34247858' and 'OAR2_35595985.1' have the same position.
Warning: Variants 's46929.1' and 'oar3_OAR2_74519162' have the same position.
Warning: Variants 's73098.1' and 'oar3_OAR3_20274083' have the same position.
Warning: Variants 'oar3_OAR3_34411557' and 'OAR3_36830729.1' have the same position.

It seems some variants (42) still don't have right position, any idea how i can fix that?

ADD REPLY • link updated 7.7 years ago by GenoMax 152k • written 7.7 years ago by bha ▴ 80

0

Entering edit mode

What if you try without ids-only?

plink --bfile HighDensity --list-duplicate-vars suppress-first
plink --bfile HighDensity -exclude plink.dupvar --out HighDensity.DuplicatesRemoved

Also, if you try the merge command without removing duplicates, from my experience, will automatically identify the duplicates between the datasets you're merging and will output them to a file, which can then be used with --exclude

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

It doesn't work without ids-only, same error as with ids-only. Well, I tried to merge without removing duplicates, but same warning messages. I am using PLINK v1.90b3v 64-bit (15 Jul 2015). Do you think it will make difference if i try some other version? what version you were using for merging? I use this for merge:

plink --bfile LowDensity --bmerge HighDensity --make-bed --out mergewithduplicates

ADD REPLY • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by bha ▴ 80

1

Entering edit mode

Hey, I go through the process of duplicate removal and then merging of datasets here: Produce PCA for 1000 Genomes Phase III in VCF format

In summary, how I normally do it is:

plink --noweb --bfile MyData --list-duplicate-vars ;
cut -f4 plink.dupvar | cut -f1 -d" " > Duplicates.list ;
plink --noweb --bfile MyData --exclude Duplicates.list --make-bed --out MyDataLessDuplicates ;

For merging, I use:

plink --merge-list ForMerge.list --out MergedData ;

ForMerge.list just contains the PLINK datasets that I want to merge (BIM files).

Nota bene:

I think that the key that my colleague and I may have missed is that the output of --list-duplicate-vars, i.e., plink.dupvar, is not compatible with --exclude. --exclude just expects a single-column file of variant IDs that you want to exclude (or variant IDs separated by spaces). So, something like:

oar3_OAR1_169947942

OAR1_183082360.1

oar3_OAR1_224944169

OAR1_242468071.1

oar3_OAR2_34247858

OAR2_35595985.1.

s46929.1

oar3_OAR2_74519162

s73098.1

oar3_OAR3_20274083

oar3_OAR3_34411557

With --list-duplicate-vars I would still add suppress-first after it, just so that 1 copy of the duplicate will be retained (otherwise, both ar removed).

I also use PLINK 1.9, and virtually the same release as you:

/Programs/plink1.90/plink
PLINK v1.90b3.38 64-bit (7 Jun 2016)       https://www.cog-genomics.org/plink2
(C) 2005-2016 Shaun Purcell, Christopher Chang   GNU General Public License v3

ADD REPLY • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by Kevin Blighe 89k

1

Entering edit mode

--list-duplicate-vars's ids-only modifier bridges the gap with --exclude.

The --merge-equal-pos flag causes variants with identical positions to be merged, keeping the IDs in the first file. However, this is conservative: if there's a single pair of same-position variants which have more than 2 distinct alleles between them, the merge will error out. plink 1.x is not designed to handle triallelic SNPs.

ADD REPLY • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by chrchang523 11k

1

Entering edit mode

Why would multiple single position variant entries exist for biallelic variants? To me, that just sounds like a dirty VCF file, not a legitimate use case.

ADD REPLY • link 7.7 years ago by Ram 45k

0

Entering edit mode

Yes, this is why I always normalise VCFs with:

bcftools norm -m-any My.VCF | bcftools norm --check-ref w -f MyRefGenome.fasta | bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT' > Normalised.bcf ;

bcftools index Normalised.bcf ;

That code also assigns a unique identifier to each variant in the ID field of the VCF, as chr:pos:pos:ref:alt.

...and then read these into PLINK.

ADD REPLY • link 6.9 years ago by Kevin Blighe 89k

0

Entering edit mode

But then you end up losing the rs# annotation no?

ADD REPLY • link 7.7 years ago by Ram 45k

0

Entering edit mode

Also, why do you use -x ID -I +<string>? Would just -I <string> suffice?

https://samtools.github.io/bcftools/bcftools-man.html#annotate

ADD REPLY • link 7.7 years ago by Ram 45k

0

Entering edit mode

Yes, but the rs ID system of naming variants is problematic because it's not curated sufficiently. A single rs ID can relate to multiple variants at the same position. Also, a single rs ID can relate to variants at more than 1 position in the genome. The fundamental issue being that rs IDs are not unique at all...

It's too risky working with rs IDs on clinical data, where mix-ups / mess-ups just aren't allowed...

In the code above, you have to use -x to first delete the ID field. Without -x, it will not update it to what you specify with -I.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

1

Entering edit mode

Yeah, we don't use rsIDs for anything significant, all annotation is done by position. The manual says By default all existing IDs are replaced. If the format string is preceded by "+", only missing IDs will be set., so I don't see why -x would be needed unless the documentation is wrong.

ADD REPLY • link 7.7 years ago by Ram 45k

Ram · Answer 2 · 2017-11-03

0

Entering edit mode

7.7 years ago

bha ▴ 80

plink --bfile HighDensity --list-duplicate-vars --out test
plink --bfile HighDensity --exclude test.dupvar --out test1

Sorry, I couldn't get where --exclude suppress-first comes in above command, can you please elaborate?

ADD COMMENT • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by bha ▴ 80

1

Entering edit mode

Please do not add answers unless you intend answering your own question. This post belongs as a comment on Kevin's answer. Also, please use the 101010 button on the formatting bar to format your code. See for instance Kevin's comment below and you'll see the difference.

ADD REPLY • link 7.7 years ago by Ram 45k

0

Entering edit mode

Hi bha,

It can be executed like this::

plink --bfile HighDensity --list-duplicate-vars
plink --bfile HighDensity --exclude suppress-first plink.dupvar --out HighDensity.DuplicatesRemoved

ADD REPLY • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Options in effect: --bfile HighDensity --exclude supress-first plink.dupvar --out HighDensity.DuplicatesRemoved
Error: Invalid --exclude parameter sequence.

it doesn't accept exclude, any other option? I also tried with --extract, but doesn't work

ADD REPLY • link updated 7.7 years ago by Ram 45k • written 7.7 years ago by bha ▴ 80

1

Entering edit mode

Can you paste the contents of plink.dupvar and also mention your PLINK version?

If you can, also, output a few lines from your map file?

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k