Retrive genomic sequences of genes of all ensembl organsims
0
0
Entering edit mode
4.7 years ago
elisheva ▴ 120

Hi everyone
I am trying to download gene sequences of all Ensembl organisms in fasta format including:
Vertebrates, Metazoa, Plants, Fungi, Bacteria and Protists
In the past I worked only with vertebrates, therefore I downloaded the whole genome and the matching gff file via ensembl ftp site.
Thereafter I extracted the protein-coding gene sequences using bedtools getfasta command.
Now I want to do the same process for all the organisms and not only vertebrates, but the problem is that I limited with memory.
Is there any way to download the gene sequences directly with no need to download the whole genome file?
It wouldn't help me to download the cDNA sequences sine I am interested also in the introns.

Thanks

sequence sequencing ensembl ftp gene • 1.9k views
ADD COMMENT
1
Entering edit mode

you can find all the fasta sequences from FTP site of ENSEMBL http://www.ensembl.org/index.html: for example, bacteria: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46.

ADD REPLY
0
Entering edit mode

I know, that exactly what I did before.
The problem is that there are no gene sequences at the FTP site but only whole genome or cDNAs.

ADD REPLY
1
Entering edit mode

Hi Elisheva,

Unfortunately, there is no file for the 'genomic sequences of genes' that you need available on the Ensembl FTP site. I think a combination of using the GTF/GFF files from the Ensembl FTP site to retrieve the genomic coordinates, then using the Perl API with genomic slices defined using the coordinates will get you the sequences you need most efficiently: http://www.ensembl.org/info/docs/api/core/core_tutorial.html#slices

Best wishes

Ben Ensembl Helpdesk

ADD REPLY
0
Entering edit mode

Thanks a lot, I deeply appreciate your response.
Is there any course or tutorial to study how to work with the Perl API?
I am completely new to this thing (for Perl in general), I couldn't understand how to download the sequences by retrieving GFF file by reading the documents at the attached link.

ADD REPLY
0
Entering edit mode

ADD REPLY
0
Entering edit mode

Hi Elisheva,

My suggestion regarding the GTF or GFF files was to first parse these files to retrieve the genomic coordinates of the genes. Then, in a second step, to use the Perl API to retrieve the genomic sequences relating to the regions defined by the coordinates from step 1.

However, you could perform steps 1 and 2 using the Perl API. You mentioned in your original question that you wanted to do this for all vertebrate and non-vertebrate species in Ensembl/ Ensembl Genomes, so creating a script to complete steps 1 and 2 for each species using the Perl API will likely be the most efficient solution.

There is a filmed Perl API workshop available through EBI TrainOnline: https://www.ebi.ac.uk/training/online/course/ensembl-filmed-api-workshop

It's a little old but the central concepts and the Core modules remain relevant.

There are also installation and tutorial instructions in the documentation: http://www.ensembl.org/info/docs/api/index.html

Genomax's video tutorial relates to the Ensembl REST API, which is a language agnostic API to access the Ensembl data. You could consider this option as well, but I don't think it will be an effective method for retrieving the data you need on such a large scale.

Best wishes

Ben Ensembl Helpdesk

ADD REPLY

Login before adding your answer.

Traffic: 2519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6