Hi--
I've got some gwas data I'd like to impute, but for that to happen, I need every snp to be aligned to the forward strand of the reference genome. This is not as simple as it sounds, due to many snps being ambiguous (A/T or C/G) combos.
Therefore I've tried looking at both strand data for the chips, and also the snp manifests, comparing them to the snps that are flipped, but I cannot see any pattern. What I'm looking for is a pattern in these files which explains why the snps on the first list is incorrect compared to the reference genome. If you see anything or would need more info please do ask.
Ps. the data might be botched by the researchers who used these data originally (they've moved on long since.)
Here is the head of a list of snps that are flipped compared to the reference (name, chr, position, a1, a2, reference nucleotide):
rs1774963 1 21703207 C T G
rs2257576 1 83736947 T C A
rs315041 1 77055775 A G C
rs3094315 1 752565 C T G
rs3737728 1 1021414 T C A
rs11721 1 1152630 T G C
rs2887286 1 1156130 G A T
rs3813199 1 1158276 T C G
rs3766186 1 1162434 T G C
Here are the corresponding entries from the strand file (http://www.well.ox.ac.uk/~wrayner/strand/):
rs1774963 1 21703208 99.1735537190083 + AG
rs2257576 1 83736948 100 + AG
rs315041 1 77055776 99.1735537190083 - AG
rs3094315 1 752566 99.1735537190083 + AG
rs3737728 1 1021415 100 + AG
rs11721 1 1152631 99.1735537190083 + AC
rs2887286 1 1156131 100 - AG
rs3813199 1 1158277 99.1735537190083 + AG
rs3766186 1 1162435 99.1735537190083 + AC
Here are the corresponding entries from the snp table/manifest:
Name SNP ILMN Strand Customer Strand
rs1774963 [A/G] TOP BOT
rs2257576 [A/G] TOP BOT
rs315041 [T/C] BOT TOP
rs3094315 [T/C] BOT TOP
rs3737728 [A/G] TOP BOT
rs11721 [A/C] TOP BOT
rs2887286 [T/C] BOT TOP
rs3813199 [A/G] TOP BOT
rs3766186 [A/C] TOP BOT
What is the rule that explains why the snps on the first lists are opposite of the reference genome? Or might these data be non-sensical?
I'd have thought that the variants' alleles in the first list were reported on the reverse strand e.g. rs3737728 from dbSNP but on the forward strand elsewhere e.g. Ensembl. I've not checked all of the variants above but for the ones I did, this seems to be the case. Check this FAQ.