Dear experts, I have a genotype data which I want to use for GWAS. The genotype data contains all columns, except allele columns i.e. Ref and Alt alleles. It has all other information, such as chromosome position, chromosome number, and the alleles in my sample etc. It has already been aligned to the reference genome, but I am confused about the Ref and Alt alleles. Is there any way to get it? any software which can extract reference and alternative allele? It is not in any format. Its just a text file. I need to find the alleles for association.
Can you please post a small sample of what this data looks like?
Sorry, my genotype data looks like this,
there is no allele column, I tried to fill NA's to make it HapMap format, and then converted it to VCF by using Tassel. I filled every information column with NA's including the allele column, because allele column is not needed for association in some packages, but it is needed for annotation of gwas results. I need to find alleles before doing GWAS. This is the format I made by filling NA's. It is HapMap format. I converted it to vcf also.
sorry, I am unable to upload complete picture, there is no option to upload picture.
Do you mean that in the top sample above you want to know (or example), whether T or A is the REF allele (with the other being the ALT)?
Yes, that's what I am trying to find. I tried a method in tassel to find Ref and Alt alleles, but I am not sure whether it is right or wrong. I've converted my text file to HapMap format to make it readable by Tassel, by filling NA's in the allele column, then converted it to VCF. This way, it gives the Ref and Alt alleles. Tassel assign alleles on the basis of allele frequency i.e. major allele as REF allele. Is there any other method, which can accurately find alleles?
I can think of a couple of ways. Is this human data? If so you can probably use the rsIDs to look up the ref and alt alleles in SNPdb using (I would guess) biomart. Or if its not human, but you have a VCF of the known SNP locations in the genome, you can go thorugh and match them up.
Finally, in the abscence of all that, I'd guess you could write a script to use the chromosome and location to look up what the reference genome sequence is at that position, then mark that as the REF allele and the other as ALT.
How many lines do you have of this?
thanks, I have 300K snps in my dataset. I cannot search it by rsIDs, because it is a plant data. I think plants don't have rsIDs, like humans snps have. Therefore, I need to look for other ways to find alleles.
can you tell us what is
AA
in sample1 orTT
in sample2AA and TT are SNPs in my sample.