I'm analyzing some Illumina genotype data and noticing that almost 2% of variants on the array are typed more than once. I consider these measurements to be replicate if they are measuring the same chromosome, position, reference allele, and alternate allele as specified in the PLINK BIM file. It seems that the typical thing people do is remove all but the first replicate in the file using the --list-duplicate-vars option in PLINK. Are there any tools that implement something slightly smarter than this? Like, for example, a desired behavior might be to take a consensus vote amongst the calls from the replicates to decide which one to use. Thanks in advance!
Thanks for the tip! Are you referring to the --merge-equal-pos option in PLINK 1.9? If so, do you know how it performs the merge? The documentation is ambiguous stating:
If two variants have the same position, PLINK 1.9's merge commands will always notify you. If you wish to try to merge them, use --merge-equal-pos. (This will fail if any of the same-position variant pairs do not have matching allele names.) Unplaced variants (chromosome code 0) are not considered by --merge-equal-pos.
Note that you are permitted to merge a fileset with itself; doing so with --merge-equal-pos can be worthwhile when working with data containing redundant loci for quality control purposes.
There is no reference to this in PLINK 2 as far as I'm aware
Also, as a quick note, the command:
plink --bmerge inputdata --bfile inputdata --merge-equal-pos --out outputdata
fails when (as in my case), variants and the same position are replicates and in other cases, they are multi-allelic variants and so have different ALT alleles.
For now, I'd use PLINK 2.0's --set-all-var-ids flag to assign brand new chrom/pos/ref/alt-based IDs to every variant, and then follow up with PLINK 1.9's merge function (since yes, merge is not yet implemented in 2.0).