VCF files - how to handle entries with identical ID but different positions and alleles?
0
0
Entering edit mode
4.7 years ago
Volka ▴ 180

Hi all, I am working on VCF files on imputed data from the Sanger imputation server, and ran into a problem where there are entries that share the same SNP ID, but have different positions and alleles. An example below:

1 5265430 rs71574343 G C . PASS
RefPanelAF=0.321086;AN=210;AC=105;INFO=0.940381 GT:ADS:DS:GP
0|0:0.05,0:0.05:0.95,0.05,0 ...

1 5265438 rs71574343 C T . PASS RefPanelAF=0.302516;AN=210;AC=41;INFO=0.940706 GT:ADS:DS:GP 1|1:0.75,1:1.75:0,0.25,0.75 0|0:0,0:0:1,0,0 0|0:0,0:0:1,0,0 0|1:0,0.95:0.95:0.05,0.95,0
1|0:1,0.05:1.05:0,0.95,0.05 ...

How should I handle these? I am looking to carry out QC with PLINK, but these duplicates cause errors. I have already used the following commands beforehand to remove duplicates and IDs that are '.':

bcftools norm -d all in.dose.vcf.gz -o out.vcf

bcftools view -e 'ID=="."' in.vcf -o out.vcf

Sanger duplicate VCF ID allele • 1.5k views
ADD COMMENT
1
Entering edit mode

I would venture out and ask why you have duplicate IDs. Unless you are merging different versions of identifiers... this shouldn't be happening. According to dbSNP the hg19 location is of rs71574343 is at position 5265438. Thus I would imagine your first entry should not be attributed to that specific dbSNP id. Furthermore, it seems the rs id has been updated and merged into a new id - so maybe updating your reference database is necessary.

Finally, after looking into it the first SNP actually has it's own id in dbSNP: rs6603820 so that is definitely very concerning.

ADD REPLY
0
Entering edit mode

Indeed, please elaborate on how they were annotated.

ADD REPLY
0
Entering edit mode

Thanks for the replies. I am not too sure how to answer on the annotation process so please bear with me. These are files I got as output from the Sanger imputation server. These SNPs were not present in my pre-imputation VCF files which I used as the input, and are present only in the imputed output.

ADD REPLY
0
Entering edit mode

"Holy smokes Batman!" ... perhaps you could contact Sanger imputation Server about it. Maybe their reference panel is outdated.

dbSNP, as a databases, has quite a few issues, though, including duplicate (i.e., the same) rs IDs targeting different positions.

ADD REPLY
0
Entering edit mode

Hmm, looking at the VCF headers, there is this line that seems to be their annotation process.

##bcftools_annotateCommand=annotate -Ou -c CHROM,POS,ID,REF,ALT -a resources/dbsnp/dbsnp_144.b37.tab.gz -Oz -o a62924e395b66aeb44e4527938df3c

Could this be referring to dbsnp build 144, as opposed to the latest build 153?

ADD REPLY
0
Entering edit mode

Looks like it, yes, and also GRCh37 co-ordinates.

ADD REPLY

Login before adding your answer.

Traffic: 2020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6