Hello!
I am using Ensembl's public MySQL server (http://uswest.ensembl.org/info/data/mysql.html) to get coordinates & alleles for a set of rsids. I am using the variation_feature table in the homo_sapiens_variation_85_38 database. I have noticed for a subset of variants seq_region_end is less than seq_region_start. For example, for rs2066847, seq_region_start is 50729868 and seq_region_end is 50729867. It seems like the start and end coordinates have been reversed but the alleles returned are correct ("-/C") and not the reverse complement. See http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=2066847.
With two simple queries, I found that there are 5336568 entries where seq_region_end is less than seq_region_start and 4996794 entries where seq_region_end is greater than seq_region_start.
Can anyone explain what's going on here or point me to some documentation explaining this? Does this have to do with forward vs. reverse strand? From what I've seen so far, it seems like all insertions have start and end reversed - is that true? If so, why?
Thanks!
Jon
I found this in the ensembl documentation: "Most of our SNPs and short insertion-deletions are from NCBI dbSNP. Variants in dbSNP can be on either the forward or reverse strand. Ensembl determines the forward-stranded allele and reports it."
so when ensembl imports variants from dbsnp, they correct the alleles to reflect the forward strand but do not correct the coordinates? am i missing something?