I am designing a genotype panel to distinguish several Mus musculus strains (FVB/N, 129/SV, C57BL/6) from Mus spretus at ~400 loci spread through the mouse genome. Illumina sells a mouse genotype panel, but it is not designed against spretus, and that's the primary requirement. It would be convenient but not required to identify polymorphisms that are already in dbSNP.
Jax Informatics has a nice query interface apparently back-ended by dbSNP, but I can't see any way to ask it to show me only loci where there is a known genotype for all four strains, and where spretus is distinct from the other three strains. I know the information I want must be in dbSNP. I don't think I can download the polymorphisms from Jax, presumably because they get them from dbSNP. I could pull down a lot of queries from Jax and grind over it, but that's not very elegant. Any ideas on how to solve this in code?
EDITED TO SHOW AN EXAMPLE:
rs4222137, at chr1, 4,678,222 is G for Mus spretus and A for B6, 129, and FVB (reverse strand). Output ideally would be the rs ID, the location, and the genotypes:
SNP CHR LOCATION B6 129 FVB SPR
rs4222137 chr1 4678222 A A A G
The XML files that Peter linked may solve it. They have the form:
<SnpInfo rsId="3023491" observed="C/T">
<SsInfo ssId="4319850" locSnpId="X86368_367C_4" ssOrientToRs="fwd">
<ByPop popId="1064" hwProb="0.001" hwChi2="15" hwDf="1" sampleSize="30">
<GTypeByInd gtype="T/T" indId="2920"/>
<GTypeByInd gtype="C/C" indId="4464"/>
...
Where I think 2920 is C3H/HEJ and 4464 is CAST/EI (these are mouse strains). Time to break out the SAX parser...
In this case, Don't use SAX but Stax. Please, it is still not clear to me: the 3 strains for musculus are FVB/N, 129/SV, C57BL/6 but what is the strain for spretus ?
Don't use SAX here, but StAx (Streaming API for XML)