Hello, I am new to genomics and a bit confused with an issue that might sound trivial for many of you. It has to do with ancestral and derived alleles from SNP data.
I have 2 data files (df1 and df2) containing SNP information from two human populations generated by PLINK (.frq format). These files contain the minor (A1) and major (A2) alleles for every SNP. As I understand, minor and major alleles correspond to alternate and reference alleles, respectively (is this correct?). Moreover, I understand that USUALLY reference alleles are the same as ancestral alleles.
So, for my SNP list, in order to get their ancestral alleles, I downloaded the file "SNPAncestralAllele.bcp" following the instructions given here: http://www.ncbi.nlm.nih.gov/books/NBK44409/#Build.allele_bcp_gz_and_snpancestralalle
Now comes the part where I am confused. When I do a match for ref alleles in df1 or df2 and ancestral alleles in "SNPAncestralAllele.bcp", I get that aprox. 2/5 of the ref SNPs do not match the ancestral alleles in df3. I am surprised that so many ref alleles do not match the ancestral alleles from "SNPAncestralAllele.bcp". Could this be due to major alleles from df1 and df2 occurring in the (-) DNA strand?
So, is my reasoning correct?, and can I assume that the alternate (minor) alleles in the SNPs that do not match represent the derived alleles in my data?
Thanks in advance for your help!