Hi all. Does anyone know how to fetch all the 3'UTR sequences of the Ensembl EPO mammals using the Ensembl Perl API? Thanx in advance.
Hi all. Does anyone know how to fetch all the 3'UTR sequences of the Ensembl EPO mammals using the Ensembl Perl API? Thanx in advance.
Here's something I coded that should do the job :)
#!/usr/bin/env perl
# Perl script to retrieve the 3'-UTR sequence for 12 eutherian mammals
# Coded by Steve Moss (gawbul [at] gmail [dot] com)
# http://stevemoss.ath.cx/
# make things easier
use strict;
use warnings;
#import EnsEMBL and BioPerl modules
use Bio::EnsEMBL::Registry;
use Bio::SeqIO;
use Data::Dumper;
# setup array of all epo mammals
my @epo_mammals = ("homo_sapiens", "gorilla_gorilla", "pan_troglodytes", "pongo_abelii", "macaca_mulatta", "callithrix_jacchus", "mus_musculus", "rattus_norvegicus", "equus_caballus", "canis_familiaris", "sus_scrofa", "bos_taurus");
# connect to EnsEMBL
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db( -host => 'ensembldb.ensembl.org',
-user => 'anonymous');
# use Bio::SeqIO for write sequence to STDOUT
my $outseq = Bio::SeqIO->new( -fh => \*STDOUT,
-format => 'FASTA');
# loop over the mammals
foreach my $mammal (@epo_mammals) {
# get gene adaptor
my $gene_adaptor = $registry->get_adaptor($mammal, 'Core', 'Gene');
# fetch all genes
my @gene_ids = @{$gene_adaptor->list_stable_ids()};
# traverse through genes
foreach my $gene_id (@gene_ids) {
# get gene
my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
# get canonical transcript
my $transcript = $gene->canonical_transcript();
# check gene has a transcript associated
unless (defined $transcript) {
next;
}
# get 3'-UTR - returns a Bio::Seq object
my $tputr = $transcript->three_prime_utr();
# check transcript has a 3'-UTR annotated
unless (defined $tputr) {
next;
}
# print to STDOUT
print $outseq->write_seq($tputr) . "\n";
}
}
This currently prints the sequence in FASTA format to STDOUT, but you could replace -fh => \STDOUT,* with -file => ">filename", to write to a file instead.
Also, the default Bio::Seq object returned by the transcript object's three_prime_utr method, contains only the sequence and the associated transcript stable_id as the display_id. I would probably create a custom Seq object with gene stable_id, transcript stable_id, and the start and end coordinates of the 3'-UTR. As the three_prime_utr method only returns the Bio::Seq object, it is necessary to determine the start and end of the 3'-UTR based on the start of the transcript and the start of the coding region respectively (or the end of both, if it is on the alternate strand).
Did you look at the Ensembl API Tutorial? There is an example just about retrieving 3' UTRs: http://www.ensembl.org/info/docs/api/core/core_tutorial.html#genes
You can just copy & paste the example in the tutorial and change the name of the species. Note that the example fetches all the UTR in the genes in a chromosomal region: if you want to specify a single gene, you can just use a gene adaptor instead of a slice adaptor.
If you have any specific problem in using the APIs, you can get better answers by editing your question and specifying what is causing you trouble.
I have a script in R to fetch 3UTR via biomaRt: http://ygc.name/2011/03/02/easiest-utr-sequence/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hello Steve! Thanks so much for the help, the code works just fine. I also need a list of the transcripts for which there are no annotated UTRs. What would be the modification to the code? I'm not familiar with OO Perl at all. Again than you so much.
In the part that says "unless (defined $tputr) {", you could add "print $transcript->stableid();" above the "next;" statement? You might want to add "my $fputr = $transcript->fiveprime_utr();" below the "my $tputr..." bit and check to see if both $tputr and $fputr are defined? It depends if you want to print the transcript ID if both, or one or the other of the UTRs aren't annotated. Bare in mind that it is likely to be annotation, rather than biologically real lack of UTRs.
It would be useful to get a basic grasp of Perl though, if you want to get to grips with the API better. You can find information on the individual parts of the (Core) API here http://www.ensembl.org/info/docs/Doxygen/core-api/index.html.
Hey again, Steve! Thanks for all the help, my friend. Yeah, I'll start on object-oriented Perl. I can do some Perl but only for text manipulation, not CGI or DBI. Thanks again.
No problem :) Any of the O'Reilly Perl books are great for getting to grips with things - http://shop.oreilly.com/category/browse-subjects/programming/perl.do and http://search.oreilly.com/?q=bioinformatics&x=0&y=0
I'm gonna get some of them no matter what. Two more questions(the last ones i promise): will the script get all of the transcripts reported for each gene? And how to print the resulting fasta file like:
At the moment this is only getting the canonical transcript that is annotated for each gene. In order to get all transcripts you would need to replace my $transcript = $gene->canonicaltranscript(); with my @transcripts = @{$gene->getall_Transcripts()}; and then iterate over this using foreach my $transcript (@transcripts) { ... } where ... Is the code for retrieving the UTRs.
I think you can just add $tputr->desc = $transcript->stableid . "|" . $gene->stableid; and it should add those to the fasta description line (not currently at the 'puter to check that) :-)
Done, Steve. The latter gives the error "Can't modify non-lvalue subroutine call". But I modified the script to print in another file EnsemblTranscriptID|EnsemblGeneID lines (print AMPL ${geneid} . "|" . $transcript->stableid() . "\n";). And added it to each transcript with another simple script. Zillion thanks one more time. Next step for me: Obj-oriented Perl.