For a positive selection test that I want to use I need the ancestral states of all SNPs present on my data.
I checked this FAQ from NCBI, followed the instructions and downloaded a file that contain the rsnumber, physical position and ancestral state of over 60 million SNPs. However as a simple test, when I try to match some SNPs present in my data based on the rsnumber and physical position I didn't get any match. But when I entered the SNP on the dbSNP website I could find the SNP with a putative ancestral state with a matching physical position.
The last upadte from the downloaded file is March 2014, but I couldn't find a reference to the build.
Are there other places where I could get the ancestral states of SNPs? Or find an updated file from dbSNP?
Thank you in advance.
EXTRA INFORMATION FOR COMMENT
Example of an rsSNP in my data:
This is an rsSNP present in my data with its physical position based on the GRCh37 assembly.
rs2823639 17576565
When I check the SNPAncestralAllele.bcp.gz
file for this rsSNP I get these matches:
rs2823639 0 A
rs2823639 1050982 A
rs2823639 1052591 A
rs2823639 1056295 A
rs2823639 1056571 A
rs2823639 1061835 A
The information on the dbSNP website is however this:
GRCh38 16204245
GRCh37.p13 17576565
Ancestral allele: A
The ancestral state is the same but the physical position is not.
Can you post some sample rs#s from your dataset? Also what is the name of the file you downloaded?
Did you have a look at this instruction for getting ancestral SNP state?
http://www.ncbi.nlm.nih.gov/sites/books/NBK44409/#Build.how_do_i_download_a_flat_file_that
Yes I checked the instructions and downloaded two files:
Allele.bcp.gz
andSNPAncestralAllele.bcp.gz
.See edited answer to an example of an rs# of my sample.
Thanks for the help!
I am not sure we are seeing the same SNPAncestralAllele file.
The column definitions for the SNPAncestralAllele file from human_9606_table.sql is
The second column in the table you posted is not chromosomal position but the batch_id
The chromosome position can be obtained from the b142_SNPChrPosOnRef_106.bcp file (for GRCh38). The column definitions for this file (again from human_9606_table.sql) is
The chromosome position for rs2823639 from b142_SNPChrPosOnRef_106.bcp file is
The reason for the -1 difference in chromosome position in .bcp file (compared to the dbSNP website) is explained here
The FTP files I linked are for the GRCh38. You can get the corresponding files for GRCh37.p13 here
Thank you so much Siva! I downloaded the new files for GRCh37 and will try to match my rsnumber and physical position to them. I have another question though,
b142_SNPChrPosOnRef_105.bcp
andSNPAncestralAllele.bcp
have different number of rows. Shouldn't they be the same?You are welcome. The
b142_SNPChrPosOnRef_105.bcp
file has unique rows (chromosome position) for eachsnp_id
whereas there can be more than more row (multiple submissions/batch_ids) for the samesnp_id
inSNPAncestralAllele.bcp
file. In the example you posted in your original post, there are 6batch_id
s for 1snp_id
.Thank you so much! This solved all my questions!
Hi Siva, I just encountered another problem. For several rsSNPs I found that different batches point to different ancestral alleles. Will batch number should I trust? The latest one? I searched for information on batches on the dbSNP website but couldn't find anything.