I am interested in generating the flanking sequences (10 nt from each side) from a list of ~500,000 SNPs.
I am using the perl API variation to extract this information, however the script throws an error when it encounters many of the recently discovered SNPs (from the 1000genomes data).
CODE:
use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::Slice qw(split_Slices);
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
my $va_adaptor = $registry->get_adaptor('human', 'variation', 'variation'); #get the different adaptors for the different objects needed
my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature');
my $gene_adaptor = $registry->get_adaptor('human', 'core', 'gene');
my @rsIds = qw(rs6688221 rs113821789 rs112155239 rs10147700 rs74917091);
foreach my $id (@rsIds){
# get Variation object
my $var = $va_adaptor->fetch_by_name($id); #get the Variation from the database using the name
&get_VariationFeatures($var);
}
sub get_VariationFeatures{
my $var = shift;
# get all VariationFeature objects: might be more than 1 !!!
foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){
print $vf->variation_name(),","; # print rsID
print $vf->allele_string(),","; # print alleles
#print join(",",@{$vf->get_consequence_type()}),","; # print consequenceType
print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]"; #print the allele string
print substr($var->three_prime_flanking_seq,0,10), ","; # print RefSeq
print $vf->seq_region_name, ":", $vf->start,"-",$vf->end, "\n"; # print position in Ref in format Chr:start-end
#&get_TranscriptVariations($vf); # get Transcript information
}
}
OUTPUT: MSG: Bio::EnsEMBL::Variation::Variation arg expected STACK Bio::EnsEMBL::Variation::DBSQL::VariationFeatureAdaptor::fetchallbyVariation /Users/ensembl/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/VariationFeatureAdaptor.pm:189 STACK main::getVariationFeatures tt.pl:28 STACK toplevel tt.pl:22 Ensembl API version = 60
Looking at the code in VariationFeatureAdaptor.pm:189, it appears these new rs #s are not found in the database being referenced. Does anyone know if there is/or will be an update, or if this data can be accessed another way to get the sequences - in bulk?
FYI - this is now updates and works very well. I've been able to retrieve all SNPs and surrounding seq.
One other question; does anyone know how I might convert a transcript ID (ENSG#####) to a gene name (BRCA1, MYC, etc)? I'd like to add this into the script, so a perl solution would be ideal.
I just sent this answer to you on helpdesk- here you are:
Slice->get_all_Genes(). You can get the gene name via the external_name() subroutine on a gene object.
Hope that helps!