I have a list of SNPs that consists of their rsids or SNP names, like rs6699871, for example. From that list, I would like to retrieve a list of gene variations that result from that SNP. I would expect something like this as output:
rs6699871 var1 CTACATCGATTTGCAGCACCCAGCTGA[C]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA <- (this is the COMPLETE gene or ORF sequence, not the X bp flanking sequence) rs6699871 var2 CTACATCGATTTGCAGCACCCAGCTGA[T]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA rs6699871 var3 CTACATCGATTTGCAGCACCCAGCTGA[G]CCAGAAATCGACAAGTCGACCCGTGCTAGATCGACGA
where the [X] refers to the location of the SNP and it's different forms (C, T, or G). Each line represents a variation in the sequence of a given gene resulting from the SNP rs6699871. This variation could be either in Introns or Exons.
So my question is: what database would have this SNP versus gene variations map that I'm looking for, or what database could have enough information on SNPs that could allow me to retrieve this output easily?
I have experience in linux terminal, perl, and c++, so if I know where to look I can make a script to retrieve this data automatically for a large list of SNPs.
Thanks!
Thanks for answering, but in the dbSNP I can't find the complete gene sequence (introns and exons included) of the gene belonging to the SNPs. Where exactly do you click to find that?
The dbSNP report for rs6699871 includes 30bp of flanking sequence on each side of the SNP (section 'Submitter records for this refSNP cluster'), which matches your example. The 'Gene View' section says 'Function class: rs6699871 is located in the intron region of XM_017002858.1'. The sequence and annotations (introns/exons) for XM_017002858.1 are available from the NCBI Nucleotide Database, if you truly require the complete gene sequence. Instructions for batch processing of dbSNP and NCBI are available on their respective websites.
Using dbSNP you can get the coordinates of your SNP. You can add some padding to those coordinates (say 20 nucleotides up- and downstream) and create a bed file containing these intervals. Then you can use bedtools getfasta. That should work. You'll have to figure out how to modify the [X] but that would be quite straightforward since you already know the forms (from the dbSNP download).
But perhaps @harold.smith.tarheel had a different approach in mind.
Thanks for your reply! Sorry maybe I didn't expained myself well in my example: I want the COMPLETE GENE sequence, including introns and exons, not the Xbp flanking sequence. And I want an automatic method, not a website that I have to click, because I will have to retrieve this info for more than 100000 SNPs.
Thanks!
Does that mean the spliced gene sequence? What about intronic variants?
You'll need the coordinates and then some bedtools magic to get the right interval(s), then the bedtools getfasta to get the fasta
I mean the unspliced gene sequences (intros+exons). I found the biomaRt R package that seems to do what I want, but I'm learning about it yet. I will share the info here if biomaRt is able to extract the complete ORF/gene.
As stated previously, both dbSNP and NCBI support batch queries.
Oh I see it. Thanks! But I still couldn't find a way to automatically download the complete ORF using these queries.
What did you attempt?