Hi, I got a file with three columns. One is an rs id, others are Allele1 and Allele2. Alleles are presented as nucleotides:
A,G
G,C
G,G
ect.
I need to create a file with AB Illumina format... so I need to convert AG, GC, GG to AB, AA or BB (depends). Can someone explain to me the best way to do that? And it's even possible having only that information what have I?
NB May 16, 2019 - although I discuss A and B alleles mostly in relation to major and minor allele in my comment (below), on the Illumina genotyping arrays, A and B relate to TOP and BOT (coding and non-coding strands).
---------------------------------
It is not an easy feat because A and B alleles can mean different things in different contexts. The common interpretation is that A relates to the major allele, whereas B relates to the minor allele. This begs the question: in which cohort are these the major and minor alleles? - the usual reference for this is 1000 Genomes data, but it can also be your study cohort.
Most likely, there will be an annotation file available for the microarray platform that was used, which will [hopefully] contain information on which allele is A or B - ask your colleagues if they know anything about this. If they know nothing, determine the microarray platform that was used and search for the annotation file online.
My other suggestion to you: confirm with your colleagues why AB format is required, and confirm the results that are requested to be obtained. Do the results necessitate a conversion to AB format?
Finally, if all else fails, annotate each of your records for 1000 Genomes Phase III allele frequencies, and then set A and B alleles manually based on the allele frequencies for each (A = major; B = minor). This will take you a bit of extra work; however, it is feasible to do.
Kevin
Thanks for answers! I learn something about Illumina. AB format is needed coz database work in it. I got additional informations than all alleles in the doc are all TOP alleles. That change anything or I still need to use manifest file and try to deal with R? Best regards!
In that case, your data is just A alleles. While the chip likely originally included A and B alleles, for downstream processing, sometimes we filter out all SNPs from one strand, i.e., those on the non-coding strand, as I mention in step 6, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2