Question

Best Practice To Find Exon, Intron, Cds Sequences Of All Eukaryotic Species?

1

Entering edit mode

13.2 years ago

Boboppie ▴ 550

What I can think of is to download annotation and sequence files from UCSC/Ensemble for all eukaryotic genomes, and programmatically extract the sequences. But it seems a dummy solution.

Any better suggestion?

sequence annotation • 4.4k views

ADD COMMENT • link updated 10.9 years ago by Biostar 20 • written 13.2 years ago by Boboppie ▴ 550

1

Entering edit mode

Sounds like a perfectly reasonable solution to me. This kind of large-scale data extraction generally requires download to local files and some type of scripting solution.

ADD REPLY • link 13.2 years ago by Neilfws 49k

0

Entering edit mode

If you use Ensembl biomart you might be able to skip downloading all databases. But what do you want to with all that data?

ADD REPLY • link 13.2 years ago by Michael 55k

0

Entering edit mode

@Michael Dondrup: hi mike, my friend would like to do a multiple sequence alignment with those sequences and generate a phylogenetic tree based on exon, intron and CDS, respectively. Does it make sense?

ADD REPLY • link 13.2 years ago by Boboppie ▴ 550

0

Entering edit mode

Note that BioMart is not designed to extract e.g. all gene sequences for an organism; it just doesn't scale and there is a chance you end up with a truncated set of results. Introns you cannot get from BioMart anyway. So, I would opt for using the Ensembl Perl API (http://www.ensembl.org/info/docs/api/index.html).

ADD REPLY • link 13.2 years ago by Bert Overduin ★ 3.7k

0

Entering edit mode

Building phylogenetic trees from all exons, introns and CDSes from all eukaryotic genomes will keep all of the computers in the world busy until the next century.

Your friend should probably narrow it to down a few selected genes, in which case the easiest solution to grab them all will not involve downloading full genomes.

ADD REPLY • link 13.2 years ago by Eric Fournier ★ 1.4k

0

Entering edit mode

@boboppie: as Eric said, no a multiple alignment of so many sequences is intractable, and it doesn't make sense imho.

ADD REPLY • link 13.2 years ago by Michael 55k

0

Entering edit mode

@Michael Dondrup @Eric Fournier Thanks for sharing the insight :) I was concerning about the logic and feasibility to carry out such task.

ADD REPLY • link 13.2 years ago by Boboppie ▴ 550

0

Entering edit mode

Hi, EnsemblCompara clusters genes into families and reconstruct MSA and tree for each. you can get such data from their dumps ftp://ftp.ensembl.org/pub/current_emf/ensembl-compara/homologies/

There are other efforts, like phylomeDB that does similar stuff.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 13.1 years ago by Leszek 4.2k

score 2 · Answer 1 · 2011-09-21

2

Entering edit mode

13.2 years ago

Istvan Albert 101k

Public interfaces were designed to be queried for specific and limited amount of information. None of these interfaces are optimized to the type of tasks that you are after, the overhead of getting all the information via queries is likely to make the approach unfeasible.

Your proposed solution is the right one, there is nothing dummy about it, using a few unix tools parallel and fastaFromBed from BedTools you can could get the job done relatively easily.

ADD COMMENT • link 13.2 years ago by Istvan Albert 101k

0

Entering edit mode

@Istvan Albert thanx!

ADD REPLY • link 13.2 years ago by Boboppie ▴ 550

0

Entering edit mode

If you like the response or find it useful, please consider voting it up. If it becomes your accepted approach to solving the problem, then click the check mark next to the response so that other will also know that it is the accepted answer. Thanks! This is an important and valuable feature of BioStar.

ADD REPLY • link 13.1 years ago by Larry_Parnell 16k