I want to use a software for detecting signals of positive selection using genome-wide SNP data. As input data I need the ancestral and derived state for each SNP present in my data.
I found this FAQ which points to a file which has the ancestral state for over 60 million SNPs.
I found almost all my SNPs present on that file, but when I was checking the ancestral alleles I encountered two problems:
First, some ancestral alleles didn't match my file. For example in my data SNP 21 rs17003832 22619035 C A
, is either C or A. But the ancestral data file points to rs17003832 which has a T
.
Second some alleles in the ancestral data file had two SNPs reported as ancestral, for example:
rs16997069 C
rs16997069 T
So what should I do? Maybe the first problem could be solved by a strand flip but what about the second one?
Any help would be greatly appreciated.
For your first question, maybe check out the version of the reference data.
BTW, I have the same question as yours,and wonder what you had done with the SNPs not in the ancestral alleles data?
Thanks for the comment. Yes, I checked the reference data and now I am working with the ancestral data for my assembly. Alas, I still got some rsSNPs with two ancestral alleles in that file. I am not sure what that problem could be. Maybe some discrepancies in different outgroups used as ancestral? I am planning to drop all SNPs with this problem, as well as those not present in this file. For the problem where I didn't got an exact match I am planning to first flip the strand and see if it then matches or else drop them. You could also check manually in the dbSNP website to see if its really a strand flip problem.