Here is the problem: Illumina calls their SNPs AA,AB,BB. The meaning of A and B depend on what they call "top" or "bottom" strand. One of the problems that I am facing is that I don't have the original data. All I have is the Illumina SNP processed file with the SNP number and genotype call (AA, AB, BB). THESE CALLS SHOULD BE UNIQUELY translatable into nucleotides.
1) let's assume for a moment that the SNP calls are from a ILMN_Human_1M chip
2) let's say for rs13536 I have a call of BB
3) what nucleotides does this correspond to on the positive strand of the reference genome?
According to Illumina:
Top Strand, Bottom Strand
1: A-G , T-C
2: A-C , T-G
So if I go to dbSNP for rs13536, and I see T/C, I'm dealing with the bottom strand, and I can use this to get the nucleotides.
I see that I can solve my problem by determining if the call is top or bottom, by following these instructions:
1 You can compute the top/bottom designation yourself using the data in the /organisms/human_9606/GWAS_arrays/ directory on the dbSNP FTP site.
2 You can look at dbSNP's top/bottom assignment, which you can access if you download the SubSNP.bcp file located in the/database/organism_data/ directory for human. The field that includes the top/bottom data is called SubSNP.top_or_bot_strand. You can access the table DDL for SubSNP in the /database/organism_schema directory.
I do both to make sure my answers are consistent. I grab:
1) ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/GWAS_arrays/ILLUMINA.ILLUMINA_Human_1M.xml.gz
2) ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/database/organism_data/SubSNP_top_or_bot.bcp.gz
In ILLUMINA.ILLUMINA_Human_1M.xml, rs13536 is top: <ss batchid="33668" buildid="127" handle="ILLUMINA" linkouturl="<a href='<a href=" http:="" www.illumina.com="" products="" arraysreagents="" wgghuman1.ilmnHuman1-rs13'>http:="" www.illumina.com="" products="" arraysreagents="" wgghuman1.ilmnHuman1-rs13<="" a><="" p>"="" rel="nofollow">http://www.illumina.com/products/arraysreagents/wgghuman1.ilmnHuman1-rs13'>http://www.illumina.com/products/arraysreagents/wgghuman1.ilmnHuman1-rs13
536" locsnpid="Human1-rs13536" methodclass="other" moltype="genomic" orient="forward" ssid="65715089" strand="top" subsnpclass="snp" validated="by-submitter"> <sequence>
<Seq5>TTTCGAACCGAGACAGATGGCAGCTAAATGAAGTTTAATTAAAGAATGAG</Seq5>
<Observed>C/T</Observed>
<Seq3>GCTGGGGCCCTTTTTATTGGGTACTGCATCTACTTCGACCACAAAAGACG</Seq3>
</Sequence>
But Illimina states that C/T is bottom. Why is it top here?
In SubSNP_top_or_bot.bcp, rs13536 is bottom, which is consistent with C/T:
13536 B 5
Why is there a conflict between the files?
dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs13536) shows bottom for both ILLUMINA assays. Why is ILLUMINA.ILLUMINA_Human_1M.xml in conflict with these?
Thanks. It seems to me that this naming convention conflicts with what is presented in ILLUMINA.ILLUMINA_Human_1M.xml. In the naming convention file, rs536477 is A/G and TOP. However, the rs536477 entries in the XML file are A/G and strand='bottom'. Does strand have a different meaning in the XML file?