This should be easy to do by now, but... we have SNP data from an Illumina exome array given to us in PLINK format. The BIM file looks like this:
1 exm2253575 0 881627 G A
1 exm269 0 881918 A G
1 exm340 0 888659 T C
1 exm348 0 889238 A G
1 exm2264981 0 894573 G A
1 exm773 0 909238 G C
1 exm782 0 909309 C T
1 exm912 0 949608 A G
1 exm991 0 977028 T G
1 exm1024 0 978762 A G
And I have all of the SNPs in dbSNP 138 downloaded as a large VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10019 rs376643643 TA T . . RS=376643643;RSPOS=10020;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000200;WGT=1;VC=DIV;R5;OTHERKG
1 10054 rs373328635 CAA C,CA . . RS=373328635;RSPOS=10055;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000210;WGT=1;VC=DIV;R5;OTHERKG;NOC
1 10109 rs376007522 A T . . RS=376007522;RSPOS=10109;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000100;WGT=1;VC=SNV;R5;OTHERKG
1 10139 rs368469931 A T . . RS=368469931;RSPOS=10139;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000100;WGT=1;VC=SNV;R5;OTHERKG
1 10144 rs144773400 TA T . . RS=144773400;RSPOS=10145;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000020001000002000200;WGT=1;VC=DIV;R5;OTHERKG
1 10146 rs375931351 AC A . . RS=375931351;RSPOS=10147;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000020001000002000200;WGT=1;VC=DIV;R5;OTHERKG
I want to match them up so that each SNP in the BIM is identified from the VCF file. This is mostly for renaming them with proper dbSNP names. I have been trying to match them by formatting them as BED files and using BEDTOOLS while restricting to SNPs that are SNVs. The problem is that there are some SNPs with the same chr/start positions. Is there an easy way to rename or identify the SNPs by including allele information with VCFTOOLS, BEDTOOLS, PLINK, or another common tool? I get matching for about 99% using BEDTOOLS and command-line options, but there must be an easiest or standard way to get this right.
Thanks,
Ryan
Did you figure this out?