Getting All Exons From Ensembl...
5
9
Entering edit mode
14.3 years ago

Hi,

I'm interested in whether anyone has worked up a solution for retrieving all unique exons from Ensembl in relation to the gene IDs and not the transcript IDs.

I have already used the Perl Ensembl Core API to retrieve all exons, for all transcripts, for all genes, but this results in redundant data, due to alternative splicing in different transcripts. Some exons therefore overlap or are replicated and therefore the true exon data is exaggerated. I want the number of exons per gene, not the number of exons for all transcripts.

It just confuses me because on the Assembly and Genebuild page for Genome Statistics (e.g. http://www.ensembl.org/Takifugu_rubripes/Info/StatsTable) it has the number of gene exons listed at 322,585, but when I download using BioMart or the Perl API I get nearly 650,000.

I guess I'm going to have to either choose the transcript with the most exons, or work on a solution that removes any redundancy by amending overlapping regions and removing complete duplicates? I suspect the former will be the easiest and hopefully not exhibit too many errors?

Cheers,

Steve

Update

A simple way to do this is by using the canonical_transcript method for the gene! So we call:

my $can_tr = $gene->canonical_transcript();
my $exons = $can_tr->get_all_Exons();
transcript ensembl exon api gene • 17k views
ADD COMMENT
1
Entering edit mode

I think you need to provide a definition of "unique" for your purposes. Naively this might mean "having an unique sequence within a gene", in which case creating a hash-table with an appropriate key within your script would do it, but it seems you want something more complex.

ADD REPLY
0
Entering edit mode

By unique I mean I want to retrieve all the non-redundant i.e. non-overlapping and non-replicated exons for each gene. The exons appear to be defined by transcripts; as new studies merge new transcripts, the exon number can increase and also become redundant.

ADD REPLY
0
Entering edit mode

I've hit another obstacle here in that the DNA sequences for the sequence region ids of each exon, aren't actually in the dna table.

I contacted the Ensembl Help-desk and I am told it would be an extremely complex task to retrieve the exon sequence data using MySQL.

Where are these missing DNA sequences? For danio_rerio_core_58_5d for example I am missing 73,448 to 89,002 (select seq_region_id from dna;), which correlates exactly with the exon sequence region ids.

Has anyone done this previously?

ADD REPLY
1
Entering edit mode

Gawbul: You may add these as a separate question or include in your question for more attention from BioStar members.

ADD REPLY
0
Entering edit mode

Thanks for the comment Khader. Thinking about etiquette, I was worried about posting lots of questions, if they were all connected in relevance. I know on some forums this can be an issue. I'll certainly bare that in mind for future posts though

ADD REPLY
0
Entering edit mode

If you've got the genome position and strand for each exon, could you maybe retrieve the sequences in Bioconductor, there seems to be D.rerio genome package.

ADD REPLY
0
Entering edit mode

Hi Cass, thanks for the comment! I need to retrieve the data for five fish and it looks like only Danio rerio is available on bioconductor? I think I'm going to parse the unique IDs I've retrieved from BioMart and then extract only the information for those exon IDs from the exon fasta file I currently have. I'll then need to build coordinates for the introns from the exons and do the same with the intron fasta file I created using the Perl Core API?

ADD REPLY
5
Entering edit mode
14.3 years ago
Cassj ★ 1.3k

What did you use as your biomart query? If you take the unique results from a query with no filters and attributes of Ensembl Gene ID, Ensembl Exon ID, Chr Start, Chr End and Chr Name then you get back 322622 exons. Results I got are here.

You'll still get overlapping exons though - I think ensembl defines an exon on the basis of its splice sites, so you can have unique exons that only differ in a few bps at one end.

ADD COMMENT
1
Entering edit mode

But you get slightly closer to the expected answer! Comparing the two sets, our results cover exactly the same genes, though. Possibly some exons have different exon IDs, but the same coordinates. The correct result depends on your definition of "unique".

ADD REPLY
0
Entering edit mode

which is exactly what Keith has already said while I was typing. Sorry!

ADD REPLY
0
Entering edit mode

By unique I mean I want to retrieve all the non-redundant i.e. non-overlapping and non-replicated exons for each gene. I think I will need to do some sort of redundancy checks on the exon coordinates by the look of things to ensure that none of the output contradicts this criteria.

I had done something similar previously, checking for redundancy in repeat elements by creating a hash table (dict in python) for the start and end coordinates with a matching incremental integer value for each pair. I then sorted by the start coordinates and matched up the pairs again before comparing them all.

ADD REPLY
0
Entering edit mode

Not sure if that was the most optimal solution, but always welcome to suggestions :)

ADD REPLY
4
Entering edit mode
14.3 years ago

The SQL tables for ensembl are available here.

Here, you'll find the definition of the tables in ftp://ftp.ensembl.org/pub/current_mysql/homo_sapiens_core_58_37c/homo_sapiens_core_58_37c.sql.gz

There is a table named exons:

CREATE TABLE `exon` (
  `exon_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `seq_region_id` int(10) unsigned NOT NULL,
  `seq_region_start` int(10) unsigned NOT NULL,
  `seq_region_end` int(10) unsigned NOT NULL,
  `seq_region_strand` tinyint(2) NOT NULL,
  `phase` tinyint(2) NOT NULL,
  `end_phase` tinyint(2) NOT NULL,
  `is_current` tinyint(1) NOT NULL DEFAULT '1',
  `is_constitutive` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`exon_id`),
  KEY `seq_region_idx` (`seq_region_id`,`seq_region_start`)
) ENGINE=MyISAM AUTO_INCREMENT=1536937 DEFAULT CHARSET=latin1;

Its data are here.

and the seq_region_id (chromosome) is defined in here.

You can access those data using mysql , see this other question.

Another solution using UCSC, not ensembl:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz" |\
  gunzip -c |\
  awk -F '\t' '{
        split($9,S,"[,]");
        split($10,E,"[,]");
        n=int($8);
        for(i=1;i<=n;++i) printf("%s\t%d\t%d\n",$2,int(S[i]),int(E[i]));
        }'
  chr1    1115    2090
  chr1    2475    2584
  chr1    3083    4121
  chr1    1115    2090
  chr1    2475    4272
  chr1    4268    4692
  chr1    4832    4901
  chr1    5658    5810
  chr1    6469    6628
  chr1    6716    6918
  (...)
ADD COMMENT
2
Entering edit mode

It's because the data from the API and Biomart are denormalized. If you look at the number of rows in the exon_transcript link table (635796), you'll see each exon in the exon table is linked to multiple transcripts.

ADD REPLY
0
Entering edit mode

Hi Pierre,

I'm finishing off a PhD students work, who used Ensembl, so not sure UCSC will be relevant, although it is helpful to know about, thank you!

According to a gentleman from the Ensembl helpdesk, there is no way to retrieve all exons per gene using BioMart or the Core Perl API; as it needs to retrieve all transcripts for all gene ids and then all exons for all transcript ids.

ADD REPLY
0
Entering edit mode

Looking at the MySQL tables however (e.g. ftp://ftp.ensembl.org/pub/current_mysql/takifugu_rubripes_core_58_4l/exon.txt.gz) I get more like what I am expecting, so I don't quite understand how BioMart and the Perl Core API are retrieving such vastly different numbers (closer to 650,000 exons for Takifugu).

Cheers,
Steve

ADD REPLY
0
Entering edit mode

Yeah, I've been investigating it for the last couple of hours and had noticed that. It just makes it a little awkward to retrieve the information I need, but I can get it directly from the MySQL database I guess!?

ADD REPLY
3
Entering edit mode
14.3 years ago
Darked89 4.7k

Few thoughts:

  • transcript with largest number of exons still may miss some of them
  • there are transcripts with equal number of exons but not all of them will be common
  • simple overlap checking will not work in possible cases (I am not sure if these are listed in Ensembl as genes) of reverse transcripts regulating a given gene. If you restrict yourself to only protein coding genes you should be OK at least in vertebrates, I think.
ADD COMMENT
3
Entering edit mode
14.3 years ago

You can do this from Biomart, despite what the man on the EnsEMBL helpdesk said, provided you limit your query attributes strictly to those that identify the gene and the exon start, end and strand.

E.g. select Takifugu rubripes genes, select no filters and attributes Structures plus Ensembl Gene ID, Gene Strand, Exon Chr Start (bp) and Exon Chr End (bp). Check the box for unique results only.

This will give a file like this:

Ensembl Gene ID Exon Chr Start (bp)     Exon Chr End (bp)       Strand
ENSTRUG00000000001      2982    3983    -1

This contains 321610 lines (including the header). I don't think this contains the 703 RNA genes, so 321609 + 703 = 322,312. Not sure where the other 273 are to make the 322585 total. Unfortunately, the stats table doesn't say exactly how their total was calculated.

ADD COMMENT
0
Entering edit mode

Many thanks! As with Cass below, I had realised that removing the Transcript ID on BioMart gave me 322,622 exons and I simply assumed this was due to minor revisions since the last release?

ADD REPLY
0
Entering edit mode
14.3 years ago

If you have already downloaded all exons, you may run a CD-HIT at a specific threshold to get non-redundant sequences. I have never tried CD-HIT with HIT, but as per the help page it should work with nucleotide sequence as well.

ADD COMMENT
0
Entering edit mode

That might work? I have used CD-HIT previously, although I think UClust 3.x is probably faster now? I have actually developed my own novel solution in Python called remred.py that is faster than both of them, but only works for 100% sequence identity, although that should work in this case, because I aren't wanting any fuzzy matches. It searches for smaller sequences within larger sequences too.

ADD REPLY
0
Entering edit mode

Actually, saying that, if an exon sequence matches in a different gene, then it may result in an under-representation of the true dataset? I could possibly due redundancy checking per gene, but it would mean splitting up my multifasta into separate gene ids first.

ADD REPLY
0
Entering edit mode

Saying that if a sequence matches in a different exon, then it could result in an underestimation of the true dataset? I could possibly do redundancy checks per gene id to get around this? But then, what if exon sequences match within genes? I think checking coordinates may actually be better!

ADD REPLY
0
Entering edit mode

It's more to do with redundancy in the position of the data, as opposed to the actual sequence itself.

ADD REPLY

Login before adding your answer.

Traffic: 1492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6