Downloading all introns from Ensembl for Grch37
3
0
Entering edit mode
8.4 years ago
mmacd ▴ 20

I was wondering if anyone could give me advice on acquiring all introns from human Grch37 from Ensembl? I can get the introns from human Grch38 through Perl API using the following code:

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::SeqDumper;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                                 -port => 5306,
                                 -user => 'anonymous',
                                 -passwd => undef,
                                 -db_version => '84');

my $gene_adapter = $registry->get_adaptor('human', 'Core', 'Gene');

my $dumper = Bio::EnsEMBL::Utils::SeqDumper->new();

while(my $gene_id = shift(@{$gene_adapter->list_stable_ids()})) {
    my $gene = $gene_adapter->fetch_by_stable_id($gene_id);
    my $canonical_transcript = $gene->canonical_transcript();
    while(my $intron = shift(@{$canonical_transcript->get_all_Introns()})) {
        $dumper->dump($intron->feature_Slice(), 'FASTA', 'introns.fasta');
    }
}

However, when I try changing the -db_version to a db version that uses Ghrc37, such as release 64, I get the following error:

For homo_sapiens_core_64_37 there is a difference in the software release (84) and the database release (64). You should update one of these to ensure that your script does not crash. DBD::mysql::st execute failed: Unknown column 'stable_id' in 'field list' at /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 312.
    -------------------- EXCEPTION -------------------- 
MSG: Detected an error whilst executing SQL 'SELECT `stable_id` FROM `gene`': DBD::mysql::st execute failed: Unknown column 'stable_id' in 'field list' at /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 312.
    STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::_list_dbIDs /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:313 STACK Bio::EnsEMBL::DBSQL::GeneAdaptor::list_stable_ids /home/.../src/ensembl/modules/Bio/EnsEMBL/DBSQL/GeneAdaptor.pm:152 STACK toplevel get_introns.pl:19 Date (localtime)    = Thu Jul 14 16:04:48 2016 Ensembl API version = 84
    ---------------------------------------------------

Anyone have advice on this or can give an easier way to get the introns? Thank you!

EnsEMBL Introns Intron Homo Sapiens Perl API • 3.2k views
ADD COMMENT
2
Entering edit mode
8.4 years ago
Denise CS ★ 5.2k

Get the API for the previous assembly from the Ensembl GRCh37 archive. You can obtain the API code as a gzipped tarball. You should use the homo_sapiens_core_84_37. Also, use port 3337, which will connect to the GRCh37 db on the latest schema.

ADD COMMENT
0
Entering edit mode

Denise is correct about using port 3337, however she is wrong about the API itself. You can use the same API to access GRCh37 as GRCh38.

ADD REPLY
0
Entering edit mode

So I am using the same code as I posted above but with port 3337 and homo_sapiens_core_84_37 as the db. I am using the newest API. However, now im getting the following error:

 -------------------- WARNING ----------------------
MSG: Human is not a valid species name (check DB and API version)
FILE: Bio/EnsEMBL/Registry.pm LINE: 1327
CALLED BY: Bio/EnsEMBL/Registry.pm  LINE: 1102
Date (localtime)    = Fri Jul 15 14:24:00 2016
Ensembl API version = 84
---------------------------------------------------

-------------------- EXCEPTION --------------------
MSG: Can not find internal name for species 'Human'
STACK Bio::EnsEMBL::Registry::get_adaptor /home/.../src/ensembl/modules/Bio/EnsEMBL/Registry.pm:1104
STACK toplevel get_introns.pl:15
Date (localtime)    = Fri Jul 15 14:24:00 2016
Ensembl API version = 84
---------------------------------------------------

I also tried replacing human with homo sapiens with no luck. Any thoughts?

ADD REPLY
0
Entering edit mode

Oh ignore the post before. If I use port 3337 with no db version specified, I get introns from GRCh37. Thank you both so much for your help!

I am having an additional issue now. My code seems to hang on the first canonical transcript's intron. It prints it multiple times and then hangs.

Here the updated code:

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::SeqDumper;

sub feature2string
{
    my $feature = shift;
    my $gene_id = shift;

    my $seq_region = $feature->slice->seq_region_name();
    my $start      = $feature->start();
    my $end        = $feature->end();
    my $strand     = $feature->strand();

    return sprintf( "%s %s:%d-%d (%+d)",
        $gene_id, $seq_region, $start, $end, $strand );
}

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                                 -port => 3337,
                                 -user => 'anonymous',
                                 -passwd => undef);

my $gene_adapter = $registry->get_adaptor('human', 'Core','Gene');

my $dumper = Bio::EnsEMBL::Utils::SeqDumper->new();

while(my $gene_id = shift @{($gene_adapter->list_stable_ids())}) {

    my $gene = $gene_adapter->fetch_by_stable_id($gene_id);
    my $canonical_transcript = $gene->canonical_transcript();

    while(my $intron = shift @{($canonical_transcript->get_all_Introns())}) {
        $dumper->dump($intron->feature_Slice(), 'FASTA', 'introns.fasta');
        my $can_transcript_id = $canonical_transcript->stable_id();
        my $istring = feature2string($intron, $can_transcript_id);
        print "$istring\n";
    }
}

Any help is appreciated.

ADD REPLY
0
Entering edit mode

Changing the while loops to foreach seemed to do the trick.

ADD REPLY
0
Entering edit mode
ADD COMMENT
1
Entering edit mode

I want the introns associated with ENSEMBL ids specifically, not UCSC ids (this is actually what we were using originally but UCSC ids are hard to convert to other ids or to give them functional meaning, so we are trying to move away from using them). Thanks though!

ADD REPLY
0
Entering edit mode

In table browser you have options for that too. Example follow the first link except in your case select track: 'Ensembl genes' and table: 'ensGene' and you will get result with Ensembl Ids.

Sample Output:

chr1 66999090 66999928 ENST00000237247_intron_0_0_chr1_66999091_f 0 +

chr1 67000051 67091529 ENST00000237247_intron_1_0_chr1_67000052_f 0 +

chr1 67091593 67098752 ENST00000237247_intron_2_0_chr1_67091594_f 0 +

chr1 67098777 67099762 ENST00000237247_intron_3_0_chr1_67098778_f 0 +

chr1 67099846 67105459 ENST00000237247_intron_4_0_chr1_67099847_f 0 +

chr1 67105516 67108492 ENST00000237247_intron_5_0_chr1_67105517_f 0 +

chr1 67108547 67109226 ENST00000237247_intron_6_0_chr1_67108548_f 0 +

Follow this link for complete output (GrCh37):

https://genome-euro.ucsc.edu/cgi-bin/hgTables?hgsid=216552565_gR3lVbaZ3inTsmltiXKxN2MZzFVa&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_ensGene&hgta_ctDesc=table+browser+query+on+ensGene&hgta_ctVis=pack&hgta_ctUrl=&fbUpBases=200&fbExonBases=0&fbQual=intron&fbIntronBases=0&fbDownBases=200&hgta_doGetBed=get+BED


For converting UCSC ids to Ensembl, table browser can return table like this,

name value

uc022bqo.2 ENST00000389680

uc004cor.1 ENST00000387342

uc004cos.5 ENST00000387347

uc031tga.1 ENST00000361624

uc011mfi.2 ENST00000361739

By following these three simple steps,

  1. Track: Ensembl genes

  2. Table: KnownToEnsembl

  3. Get Output

It is FUN playing with UCSC table browser. Try it :-)

ADD REPLY
0
Entering edit mode

Tried that too, since keeping UCSC and then converting would have been much easier, but only a very small portion (less than 10%) of my UCSC ids could be converted! The table browser is nice and easy to use, I just hope they get an updated KnownToEnsembl table soon where more ids can be converted.

ADD REPLY
0
Entering edit mode

But how about the first procedure, did you manage to get introns with ensembl ids ?

ADD REPLY
0
Entering edit mode

Do you know if the coordinates for Hg19 and Grch37 the same? Since that gives me intron coordinates based on Hg19. If they are the same, then i think that would have worked.

ADD REPLY
0
Entering edit mode

Yes, they are the same. GRCh37 = hg19

ADD REPLY
0
Entering edit mode
8.4 years ago
igor 13k

There is a much simpler solution. If you have the Ensembl GTF, you can use grep to create 2 BED files from it - 1 for "gene" (includes everything) and 1 for "exon" (includes just coding). Then use something like bedtools to subtract exons from genes.

ADD COMMENT
0
Entering edit mode

Seems interesting !! Is this proceture possible to handle transcript level intron information (since you specify 'gene')? It will be really useful, if you don't mind please provide us with step by step procedure with some commands (example). So that it can be reference for future.

ADD REPLY
0
Entering edit mode

A gene is a combination of all transcripts. Thus, if you subtract all exons, you get everything that's not an exon in any transcript. If an exon of one transcript overlaps an intron of another, than that gets subtracted. It really depends on an application to say if that is desired or not.

Anyway, my commands:

cat genes.gtf | cut -f 1-5 | grep -w "gene" | cut -f 1,4,5 > genes.bed
cat genes.gtf | cut -f 1-5 | grep -w "exon" | cut -f 1,4,5 > exons.bed
bedtools subtract -a genes.bed -b exons.bed

You can add the strand column if you need that.

ADD REPLY

Login before adding your answer.

Traffic: 2607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6