Hi all, I am working on VCF files on imputed data from the Sanger imputation server, and ran into a problem where there are entries that share the same SNP ID, but have different positions and alleles. An example below:
1 5265430 rs71574343 G C . PASS
RefPanelAF=0.321086;AN=210;AC=105;INFO=0.940381 GT:ADS:DS:GP
0|0:0.05,0:0.05:0.95,0.05,0 ...1 5265438 rs71574343 C T . PASS RefPanelAF=0.302516;AN=210;AC=41;INFO=0.940706 GT:ADS:DS:GP 1|1:0.75,1:1.75:0,0.25,0.75 0|0:0,0:0:1,0,0 0|0:0,0:0:1,0,0 0|1:0,0.95:0.95:0.05,0.95,0
1|0:1,0.05:1.05:0,0.95,0.05 ...
How should I handle these? I am looking to carry out QC with PLINK, but these duplicates cause errors. I have already used the following commands beforehand to remove duplicates and IDs that are '.':
bcftools norm -d all in.dose.vcf.gz -o out.vcf
bcftools view -e 'ID=="."' in.vcf -o out.vcf
I would venture out and ask why you have duplicate IDs. Unless you are merging different versions of identifiers... this shouldn't be happening. According to dbSNP the hg19 location is of rs71574343 is at position 5265438. Thus I would imagine your first entry should not be attributed to that specific dbSNP id. Furthermore, it seems the rs id has been updated and merged into a new id - so maybe updating your reference database is necessary.
Finally, after looking into it the first SNP actually has it's own id in dbSNP: rs6603820 so that is definitely very concerning.
Indeed, please elaborate on how they were annotated.
Thanks for the replies. I am not too sure how to answer on the annotation process so please bear with me. These are files I got as output from the Sanger imputation server. These SNPs were not present in my pre-imputation VCF files which I used as the input, and are present only in the imputed output.
"Holy smokes Batman!" ... perhaps you could contact Sanger imputation Server about it. Maybe their reference panel is outdated.
dbSNP, as a databases, has quite a few issues, though, including duplicate (i.e., the same) rs IDs targeting different positions.
Hmm, looking at the VCF headers, there is this line that seems to be their annotation process.
Could this be referring to dbsnp build 144, as opposed to the latest build 153?
Looks like it, yes, and also GRCh37 co-ordinates.