Hi,
I'm interested in whether anyone has worked up a solution for retrieving all unique exons from Ensembl in relation to the gene IDs and not the transcript IDs.
I have already used the Perl Ensembl Core API to retrieve all exons, for all transcripts, for all genes, but this results in redundant data, due to alternative splicing in different transcripts. Some exons therefore overlap or are replicated and therefore the true exon data is exaggerated. I want the number of exons per gene, not the number of exons for all transcripts.
It just confuses me because on the Assembly and Genebuild page for Genome Statistics (e.g. http://www.ensembl.org/Takifugu_rubripes/Info/StatsTable) it has the number of gene exons listed at 322,585, but when I download using BioMart or the Perl API I get nearly 650,000.
I guess I'm going to have to either choose the transcript with the most exons, or work on a solution that removes any redundancy by amending overlapping regions and removing complete duplicates? I suspect the former will be the easiest and hopefully not exhibit too many errors?
Cheers,
Steve
Update
A simple way to do this is by using the canonical_transcript method for the gene! So we call:
my $can_tr = $gene->canonical_transcript();
my $exons = $can_tr->get_all_Exons();
I think you need to provide a definition of "unique" for your purposes. Naively this might mean "having an unique sequence within a gene", in which case creating a hash-table with an appropriate key within your script would do it, but it seems you want something more complex.
By unique I mean I want to retrieve all the non-redundant i.e. non-overlapping and non-replicated exons for each gene. The exons appear to be defined by transcripts; as new studies merge new transcripts, the exon number can increase and also become redundant.
I've hit another obstacle here in that the DNA sequences for the sequence region ids of each exon, aren't actually in the dna table.
I contacted the Ensembl Help-desk and I am told it would be an extremely complex task to retrieve the exon sequence data using MySQL.
Where are these missing DNA sequences? For danio_rerio_core_58_5d for example I am missing 73,448 to 89,002 (
select seq_region_id from dna;
), which correlates exactly with the exon sequence region ids.Has anyone done this previously?
Gawbul: You may add these as a separate question or include in your question for more attention from BioStar members.
Thanks for the comment Khader. Thinking about etiquette, I was worried about posting lots of questions, if they were all connected in relevance. I know on some forums this can be an issue. I'll certainly bare that in mind for future posts though
If you've got the genome position and strand for each exon, could you maybe retrieve the sequences in Bioconductor, there seems to be D.rerio genome package.
Hi Cass, thanks for the comment! I need to retrieve the data for five fish and it looks like only Danio rerio is available on bioconductor? I think I'm going to parse the unique IDs I've retrieved from BioMart and then extract only the information for those exon IDs from the exon fasta file I currently have. I'll then need to build coordinates for the introns from the exons and do the same with the intron fasta file I created using the Perl Core API?