Hi everyone
I am trying to download gene sequences of all Ensembl organisms in fasta format including:
Vertebrates,
Metazoa,
Plants,
Fungi,
Bacteria and
Protists
In the past I worked only with vertebrates, therefore I downloaded the whole genome and the matching gff file via ensembl ftp site.
Thereafter I extracted the protein-coding gene sequences using bedtools getfasta command.
Now I want to do the same process for all the organisms and not only vertebrates, but the problem is that I limited with memory.
Is there any way to download the gene sequences directly with no need to download the whole genome file?
It wouldn't help me to download the cDNA sequences sine I am interested also in the introns.
Thanks
you can find all the fasta sequences from FTP site of ENSEMBL http://www.ensembl.org/index.html: for example, bacteria: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46.
I know, that exactly what I did before.
The problem is that there are no gene sequences at the FTP site but only whole genome or cDNAs.
Hi Elisheva,
Unfortunately, there is no file for the 'genomic sequences of genes' that you need available on the Ensembl FTP site. I think a combination of using the GTF/GFF files from the Ensembl FTP site to retrieve the genomic coordinates, then using the Perl API with genomic slices defined using the coordinates will get you the sequences you need most efficiently: http://www.ensembl.org/info/docs/api/core/core_tutorial.html#slices
Best wishes
Ben Ensembl Helpdesk
Thanks a lot, I deeply appreciate your response.
Is there any course or tutorial to study how to work with the Perl API?
I am completely new to this thing (for Perl in general), I couldn't understand how to download the sequences by retrieving GFF file by reading the documents at the attached link.
Hi Elisheva,
My suggestion regarding the GTF or GFF files was to first parse these files to retrieve the genomic coordinates of the genes. Then, in a second step, to use the Perl API to retrieve the genomic sequences relating to the regions defined by the coordinates from step 1.
However, you could perform steps 1 and 2 using the Perl API. You mentioned in your original question that you wanted to do this for all vertebrate and non-vertebrate species in Ensembl/ Ensembl Genomes, so creating a script to complete steps 1 and 2 for each species using the Perl API will likely be the most efficient solution.
There is a filmed Perl API workshop available through EBI TrainOnline: https://www.ebi.ac.uk/training/online/course/ensembl-filmed-api-workshop
It's a little old but the central concepts and the Core modules remain relevant.
There are also installation and tutorial instructions in the documentation: http://www.ensembl.org/info/docs/api/index.html
Genomax's video tutorial relates to the Ensembl REST API, which is a language agnostic API to access the Ensembl data. You could consider this option as well, but I don't think it will be an effective method for retrieving the data you need on such a large scale.
Best wishes
Ben Ensembl Helpdesk