I'd like to find transcription start sites for Ensembl genes (mouse, build 67, NCBIM37). Are these annotations available from Ensembl's site, and if so, what file(s) should I be looking at?
Via FTP
Here's the script I wrote to pull mouse exons from Ensembl, release 67 via FTP and filter them for TSS positions. I treat the start position of the first forward strand exon and the stop position of the first reverse strand exon as the position of the TSS:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Path;
sub trimWhitespace($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
sub trimQuotes($)
{
my $string = shift;
$string =~ s/^"//;
$string =~ s/"$//;
return $string;
}
my $ensemblURL = "ftp://ftp.ensembl.org//pub/release-67/gtf/mus_musculus/Mus_musculus.NCBIM37.67.gtf.gz";
my $dataDir = "../data"; if (! -d $dataDir} { mkpath $dataDir; }
my $ensemblGzFn = "$dataDir/Mus_musculus.NCBIM37.67.gtf.gz";
my $ensemblBedFn = "$dataDir/Mus_musculus.NCBIM37.67.gtf.bed";
if (! -e $ensemblGzFn) {
print STDERR "retrieving mouse records...\n";
system ("wget --output-document=$ensemblGzFn $ensemblURL") == 0 or die "could not retrieve Ensembl records from URL: $ensemblURL\n";
}
if (! -e $ensemblBedFn) {
print STDERR "converting mouse records from GTF to sorted BED...\n";
system ("gunzip --stdout $ensemblGzFn | gtf2bed - | sort-bed - > $ensemblBedFn") == 0 or die "could not convert Ensembl record to sorted BED\n";
open my $allRecFh, "<", $ensemblBedFn or die "could not open all records\n";
while (<$allRecFh>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand, $source, $feature, $frame, $attributesStr) = split("\t", $_);
if ($feature eq "exon") {
my ($tssChr, $tssStart, $tssStop, $tssId, $tssScore, $tssStrand);
my @attributes = map (trimWhitespace($_), split(";", $attributesStr));
my $attributesRef;
foreach my $attributePair (@attributes) {
my @pairItems = split(" ", $attributePair);
my $key = $pairItems[0];
my $value = trimQuotes($pairItems[1]);
$attributesRef->{$key} = $value;
}
# find TSS start position
if (exists $attributesRef->{"gene_name"}) {
$tssChr = "chr$chr";
if ($strand eq "+") {
$tssStart = $start;
$tssStop = $tssStart + 1;
}
elsif ($strand eq "-") {
$tssStart = $stop;
$tssStop = $tssStart + 1;
}
$tssId = $attributesRef->{"gene_name"};
$tssScore = $attributesRef->{"exon_number"};
$tssStrand = $strand;
my $tssGeneId = $attributesRef->{"gene_id"};
if ($tssScore == 1) {
print STDOUT join("\t", ($tssChr,
$tssStart,
$tssStop,
$tssId,
$tssScore,
$tssStrand,
$tssGeneId
))."\n";
}
}
}
}
close $allRecFh;
}
In running this FTP-sourced data search, I get 97639 records (which matches Malachi's answer via BioMart):
$ ./get_mm9_ensembl_tss_records.pl > ../data/mm9.ensembl67.tss.bed
$ wc -l ../data/mm9.ensembl67.tss.bed
97639
Records seem to match up:
$ grep Gm16627 ../data/mm9.ensembl67.tss.bed
chr17 71274336 71274337 Gm16627 1 - ENSMUSG00000085299
$ grep Vmn2r-ps113 ../data/mm9.ensembl67.tss.bed
chr17 18186447 18186448 Vmn2r-ps113 1 + ENSMUSG00000067978
chr17 18201000 18201001 Vmn2r-ps113 1 + ENSMUSG00000067978
I'll mark both answers as correct, since Malachi gives me a "sanity check" result and Emily provided the FTP URL to the Ensembl mouse data.
Via Perl API
From the Ensembl-dev mailing list, I was pointed to the following version of the ensembl portion of the Ensembl Perl API setup process:
http://www.ensembl.org/cvsdownloads/ensembl-67.tar.gz
The remaining components were installed per Emily's instructions in her answer below.
Additionally, the useastdb.ensembl.org
host only supports Releases 71 and 70. It is necessary to use ensembldb.ensembl.org
to access older releases or host the databases locally.
After making these changes to my setup, here is the script I used to pull all exons for all genes from mouse release 67, printing them in a BED6-like format:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Bio::EnsEMBL::DBSQL::DBAdaptor;
my $host = 'ensembldb.ensembl.org';
my $user = 'anonymous';
my $dbname = 'mus_musculus_core_67_37';
my $port = 5306;
my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host,
-user => $user,
-dbname => $dbname,
-port => $port);
my $slice_adaptor = $db->get_SliceAdaptor();
my $slices = $slice_adaptor->fetch_all('chromosome');
foreach my $slice (@{$slices}) {
my $chr = "chr".$slice->seq_region_name();
my $genes = $slice->get_all_Genes();
foreach my $gene (@{$genes}) {
my $exons = $gene->get_all_Exons();
my $id = $gene->external_name();
my $exon_index = 1;
my $exon_number = $exon_index;
my $exon_count = scalar(@{$exons});
foreach my $exon (@{$exons}) {
my $start = $exon->start();
my $end = $exon->end();
my $stable_id = $exon->stable_id();
my $strand = $exon->strand();
if ($strand == 1) {
$strand = "+";
$exon_number = $exon_index;
}
elsif ($strand == -1) {
$strand = "-";
$exon_number = $exon_count - $exon_index + 1;
}
else {
die "unknown value for strand\n";
}
print STDOUT join("\t", ($chr, $start, $end, $id, $exon_number, $strand))."\n";
$exon_index++;
}
}
}
This does not fully recapitulate the GTF output, and access to this remote database is slow, but it should be enough to filter the output from this script on the score column, retrieving only 1
-scored elements and looking at the strand in order to get the position of the TSS, using awk
or something else.
I'll need to verify that this can output the correct number of single-exons-via-genes, and locally hosting databases is probably a good idea in order to get a quick answer, but this is probably a pretty decent start.
Thanks for the warning. I have observed this in the past as well. Although in this case, I obtained results for 97,639 transcripts. This is the reported number for Ensembl 67 mouse and as far as I could tell there were no missing values in my output. Anyway I agree that it is straight forward to obtain this kind of info via the API, especially if you already have a local MySQL instance of the databases you need to query. On the other hand, if BioMart could be improved to handle larger queries, I'm sure it would be greatly appreciated by the community. I recognize that this can be a challenge though.
Yes, it's quite unpredictable. Sometimes it does it, sometimes it doesn't. If there were a hard and fast rule as to how and when it gives up then at least we could give sensible warnings
We would love to be able to increase its capacity. Either that or replace it with something that works in a similar way, but with higher capacity.
Thanks for your help and the links to the tutorial.
To test the installation, I downloaded/unpacked the BioPerl and Ensembl modules, changed my
PERL5LIB
environment variable and tried running the following test script:However, this does not return any data. From the
print Dumper $registry->get_all_DBAdaptors()
line, I get the following output:As this list is empty, the
foreach
block does not print anything.Is there something I am missing about the setup? I re-read the instructions and I seem to have covered all the listed steps, with the exception you noted regarding the change from
bioperl-live
tobioperl-1.2.3
.Hi Alex. I see Andy from our core team has replied to your query on the Ensembl dev mailing list. Has he solved your problem?
For completeness of answer, would Biostar members like to see what he said written here?
I will try out his suggestions and try to follow up with a write-up here. It may be useful for others who wish to learn the Perl API.