What I can think of is to download annotation and sequence files from UCSC/Ensemble for all eukaryotic genomes, and programmatically extract the sequences. But it seems a dummy solution.
Any better suggestion?
What I can think of is to download annotation and sequence files from UCSC/Ensemble for all eukaryotic genomes, and programmatically extract the sequences. But it seems a dummy solution.
Any better suggestion?
Public interfaces were designed to be queried for specific and limited amount of information. None of these interfaces are optimized to the type of tasks that you are after, the overhead of getting all the information via queries is likely to make the approach unfeasible.
Your proposed solution is the right one, there is nothing dummy about it, using a few unix tools parallel and fastaFromBed from BedTools you can could get the job done relatively easily.
If you like the response or find it useful, please consider voting it up. If it becomes your accepted approach to solving the problem, then click the check mark next to the response so that other will also know that it is the accepted answer. Thanks! This is an important and valuable feature of BioStar.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Sounds like a perfectly reasonable solution to me. This kind of large-scale data extraction generally requires download to local files and some type of scripting solution.
If you use Ensembl biomart you might be able to skip downloading all databases. But what do you want to with all that data?
@Michael Dondrup: hi mike, my friend would like to do a multiple sequence alignment with those sequences and generate a phylogenetic tree based on exon, intron and CDS, respectively. Does it make sense?
Note that BioMart is not designed to extract e.g. all gene sequences for an organism; it just doesn't scale and there is a chance you end up with a truncated set of results. Introns you cannot get from BioMart anyway. So, I would opt for using the Ensembl Perl API (http://www.ensembl.org/info/docs/api/index.html).
Building phylogenetic trees from all exons, introns and CDSes from all eukaryotic genomes will keep all of the computers in the world busy until the next century.
Your friend should probably narrow it to down a few selected genes, in which case the easiest solution to grab them all will not involve downloading full genomes.
@boboppie: as Eric said, no a multiple alignment of so many sequences is intractable, and it doesn't make sense imho.
@Michael Dondrup @Eric Fournier Thanks for sharing the insight :) I was concerning about the logic and feasibility to carry out such task.
Hi, EnsemblCompara clusters genes into families and reconstruct MSA and tree for each. you can get such data from their dumps ftp://ftp.ensembl.org/pub/current_emf/ensembl-compara/homologies/
There are other efforts, like phylomeDB that does similar stuff.