Api Variation Tools And 1000Genomes Dbsnp132
2
2
Entering edit mode
14.0 years ago
Krisr ▴ 470

I am interested in generating the flanking sequences (10 nt from each side) from a list of ~500,000 SNPs.

I am using the perl API variation to extract this information, however the script throws an error when it encounters many of the recently discovered SNPs (from the 1000genomes data).

CODE:

 use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::Slice qw(split_Slices);


my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $va_adaptor = $registry->get_adaptor('human', 'variation', 'variation'); #get the different adaptors for the different objects needed
my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature');
my $gene_adaptor = $registry->get_adaptor('human', 'core', 'gene');

my @rsIds = qw(rs6688221 rs113821789 rs112155239 rs10147700 rs74917091);
foreach my $id (@rsIds){
# get Variation object
  my $var = $va_adaptor->fetch_by_name($id); #get the Variation from the database using the name
  &get_VariationFeatures($var);
}

sub get_VariationFeatures{
  my $var = shift;
  # get all VariationFeature objects: might be more than 1 !!!
  foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){
      print $vf->variation_name(),","; # print rsID
      print $vf->allele_string(),","; # print alleles
      #print join(",",@{$vf->get_consequence_type()}),","; # print consequenceType
      print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]"; #print the allele string
      print substr($var->three_prime_flanking_seq,0,10), ","; # print RefSeq
      print $vf->seq_region_name, ":", $vf->start,"-",$vf->end, "\n"; # print position in Ref in format Chr:start-end
      #&get_TranscriptVariations($vf); # get Transcript information
  }
}

OUTPUT: MSG: Bio::EnsEMBL::Variation::Variation arg expected STACK Bio::EnsEMBL::Variation::DBSQL::VariationFeatureAdaptor::fetchallbyVariation /Users/ensembl/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/VariationFeatureAdaptor.pm:189 STACK main::getVariationFeatures tt.pl:28 STACK toplevel tt.pl:22 Ensembl API version = 60

Looking at the code in VariationFeatureAdaptor.pm:189, it appears these new rs #s are not found in the database being referenced. Does anyone know if there is/or will be an update, or if this data can be accessed another way to get the sequences - in bulk?

api variation snp genome perl • 4.6k views
ADD COMMENT
4
Entering edit mode
14.0 years ago

Hi Kris,

You saw this answer at helpdesk@ensembl.org Just to add it to the discussion:

The variation API is designed to work against the Ensembl variation database (only). We currently have dbSNP 131 in release 60 (live version). We will be getting dbSNP 132 in release 61 (due end of January, 2011). If you can wait until then, you can use the API against dbSNP 132 data.

ADD COMMENT
0
Entering edit mode

FYI - this is now updates and works very well. I've been able to retrieve all SNPs and surrounding seq.

One other question; does anyone know how I might convert a transcript ID (ENSG#####) to a gene name (BRCA1, MYC, etc)? I'd like to add this into the script, so a perl solution would be ideal.

ADD REPLY
0
Entering edit mode

I just sent this answer to you on helpdesk- here you are:

Slice->get_all_Genes(). You can get the gene name via the external_name() subroutine on a gene object.

Hope that helps!

ADD REPLY
1
Entering edit mode
14.0 years ago

If all your snps are some rs### from dbsnp what about using the NCBI E-Utilities to fetch the flanking sequence ?

for S in 6688221 113821789 112155239 10147700 74917091
do
   echo -n "rs$S";
   curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=$S&rettype=fasta&retmode=text"|\
   egrep -v "^1" | tr -d " " |\
   awk '/^>/ {s=$0; i=index(s,"|pos=");i+=5; s=substr(s,i); i=index(s,"|"); s=substr(s,1,i-1); len=int(s);seq=""; next;}
   {seq=sprintf("%s%s",seq,$0);}
END { printf("\t%s\t%s\t%s\n",substr(seq,len-10,10),substr(seq,len,1),substr(seq,len+1,10)); }'
done

rs6688221       GTGGCAGAGC      Y       TGAGCCGTTC
rs113821789     ATAGCATTAG      R       AGAAATACCT
rs112155239     CTAACCCTAA      M       CCCTAACCCT
rs10147700      TTTTTTTTTT      K       TTTGTGCATT
rs74917091      TTCTGGGATT      W       TTTTTTTTTT
ADD COMMENT
0
Entering edit mode

Thanks. I'm new to Eutilities, is there a command to extract the major and minor alleles?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I am interested in generating sequences containing the allele which is not in the current human reference assembly. Can this data be extracted?

ADD REPLY
0
Entering edit mode

That will be tricky this because you will have to handle the strands of the snp vs the genome. However, you will find the position of the SNP in the XML definition of the SNP ( see NCBI EFetch doc), and to get the reference base use a library (bioperl...) or see this other question (slower)

ADD REPLY

Login before adding your answer.

Traffic: 2497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6