Hi,
I downloaded ENCODE genotype calls from UCSC table browser. The genotypes were obtained with the Illumina 1M-Duo. But they come in the AA/AB/BB format.
I have been trying to understand how to convert them to A,C,G,T...
I guess there should be an allele table that would tell me what is the A and B allele for each SNP?
This is the file I downloaded.
Thanks!
Ines
It can get a little more complicated than that because of strand issues. This isn't something specific to ENCODE, it is an Illumina format. Illumina has a PDF technote for part of this issue here.
if you know the Illumina chip that has been used to get those genotypes, you can always try to find that map file for the allele translation needed yourself in their website.
Hi! Actually, I found some file in the illumina website but they dont really say anything about allele A and allele B. They say something about TOP/BOT alleles. I was hoping someone would have made an R package or some kind of script to deal with this issue in a more straightforward way..
having the allele translations in a file it should be very simple to build a mapping variable (a hash in perl, for instance, like
which you would use to parse your data file). if you state here which Illumina file you're looking at, or even if you paste some example lines, it would be easier to give you further advice.
There is a relationship between Illumina's TOP/BOT designation and their AB designation. however I don't think it maps to dbSNPs top/bottom designation for a SNP. Is there any way to get the ENCODE data in another format? I am sure when they originally did the genotyping they should have been able to export data in both the A/B format and the raw genotypes from Illumina GenomeStudio. I'd be surprised if they didn't offer the dataset in the alternative format. Depending on what you are going to do with the data, you may find it simpler (if it is possible) to just work with it in the AB format. Many programs will accept it quite readily.