Question

How to download FASTA protein sequences for Escherichia coli strains from Ensembl?

0

Entering edit mode

4.8 years ago

agata88 ▴ 870

Hi all!

I would like to download FASTA protein sequences from all Escherichia coli strains. At the Ensembl Bacteria page I see that I should download 2 681 files (https://bacteria.ensembl.org/info/website/ftp/index.html).

I would like to do it in programmatically way and use Ensembl Rest API. Unfortunately, I cannot find the best API Endpoints. Can anyone suggest me the best solution?

Here are the API Endpoints: http://rest.ensembl.org

Many thanks for any suggestions,

Best,

Agata

ensembl ensembl rest api bacteria sequence • 2.2k views

ADD COMMENT • link updated 4.8 years ago by Ben Moore ★ 2.4k • written 4.8 years ago by agata88 ▴ 870

0

Entering edit mode

How about using NCBI and ncbi-genome-download tool by Kai Blin? Same data everywhere.

As simple as

ncbi-genome-download --genus "Escherichia coli" bacteria

ADD REPLY • link 4.8 years ago by GenoMax 147k

score 3 · Accepted Answer · 2020-02-18

Hi Agata,

Genomax's solution seems the most straight forward, but I thought I'd add information about how to do this with Ensembl. This isn't possible with the Ensembl REST API. You will have to use a combination of the Perl API and curl/wget:

(1) Extracting the E. coli species

The LookUp module with the parent taxon id for E. coli will help. The LookUp module exists in the ensemblegenome-api repository. So this git repo needs to be in your PERL5LIB too: https://github.com/EnsemblGenomes/ensemblgenomes-api Documentation here: http://ensemblgenomes.org/info/access/eg_api

(2) Getting the peptide FASTA files from the FTP site:

We organise our bacteria into groups called “collections”. This is just to help us manage the data volumes. Accordingly, our FTP server is organised to reflect these groupings too. With each release, we provide a file below that says which species are grouped into which collection: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/species_EnsemblBacteria.txt

You can use this file to work out the right URL for files you want from the FTP server, and use curl or wget. The code snippet attached does the lookup and download:

use strict;
use warnings;
use Bio::EnsEMBL::LookUp;

# Build a helper to query the Ensembl public MySQL instance
my $lookup = Bio::EnsEMBL::LookUp->new();

my @dbas = @{$lookup->get_all_by_taxon_branch(562)};

foreach my $dba (@dbas){

        my $species = $dba->species();
        my $cmd = "grep -i \"$species\" all_bacteria_in_ensembl.txt | cut -f 13 | sed -n 's/\\(bacteria_[0-9]*_collection\\).*/\\1/p'";
        my $collection_name = `$cmd`;
        chomp($collection_name);
        if($collection_name){

                my $ftp_pep="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-46/fasta/$collection_name/$species/pep/*.pep.all.fa.gz";
                print "Fetching PEP file for $species: $ftp_pep \n";
                `wget $ftp_pep`;
        }

        $dba->dbc()->disconnect_if_idle(); # Important to disconnect so that you do not accidentally flood the server with unused connections

}

Best wishes

Ben Ensembl Helpdesk