Best Practice To Find Exon, Intron, Cds Sequences Of All Eukaryotic Species?
1
1
Entering edit mode
13.2 years ago
Boboppie ▴ 550

What I can think of is to download annotation and sequence files from UCSC/Ensemble for all eukaryotic genomes, and programmatically extract the sequences. But it seems a dummy solution.

Any better suggestion?

sequence annotation • 4.4k views
ADD COMMENT
1
Entering edit mode

Sounds like a perfectly reasonable solution to me. This kind of large-scale data extraction generally requires download to local files and some type of scripting solution.

ADD REPLY
0
Entering edit mode

If you use Ensembl biomart you might be able to skip downloading all databases. But what do you want to with all that data?

ADD REPLY
0
Entering edit mode

@Michael Dondrup: hi mike, my friend would like to do a multiple sequence alignment with those sequences and generate a phylogenetic tree based on exon, intron and CDS, respectively. Does it make sense?

ADD REPLY
0
Entering edit mode

Note that BioMart is not designed to extract e.g. all gene sequences for an organism; it just doesn't scale and there is a chance you end up with a truncated set of results. Introns you cannot get from BioMart anyway. So, I would opt for using the Ensembl Perl API (http://www.ensembl.org/info/docs/api/index.html).

ADD REPLY
0
Entering edit mode

Building phylogenetic trees from all exons, introns and CDSes from all eukaryotic genomes will keep all of the computers in the world busy until the next century.

Your friend should probably narrow it to down a few selected genes, in which case the easiest solution to grab them all will not involve downloading full genomes.

ADD REPLY
0
Entering edit mode

@boboppie: as Eric said, no a multiple alignment of so many sequences is intractable, and it doesn't make sense imho.

ADD REPLY
0
Entering edit mode

@Michael Dondrup @Eric Fournier Thanks for sharing the insight :) I was concerning about the logic and feasibility to carry out such task.

ADD REPLY
0
Entering edit mode

Hi, EnsemblCompara clusters genes into families and reconstruct MSA and tree for each. you can get such data from their dumps ftp://ftp.ensembl.org/pub/current_emf/ensembl-compara/homologies/

There are other efforts, like phylomeDB that does similar stuff.

ADD REPLY
2
Entering edit mode
13.2 years ago

Public interfaces were designed to be queried for specific and limited amount of information. None of these interfaces are optimized to the type of tasks that you are after, the overhead of getting all the information via queries is likely to make the approach unfeasible.

Your proposed solution is the right one, there is nothing dummy about it, using a few unix tools parallel and fastaFromBed from BedTools you can could get the job done relatively easily.

ADD COMMENT
0
Entering edit mode

@Istvan Albert thanx!

ADD REPLY
0
Entering edit mode

If you like the response or find it useful, please consider voting it up. If it becomes your accepted approach to solving the problem, then click the check mark next to the response so that other will also know that it is the accepted answer. Thanks! This is an important and valuable feature of BioStar.

ADD REPLY

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6