Hello everybody!
I'm working with metaproteomics samples and I want to use different search engines to look for all the proteins. In order to do this, I need a good protein databases. I tried to download from NCBI webpage, but the size of the dataset I want to get (all bacteria proteins) is too large to download via the web. There is also no pre-created set available for downloading via FTP.
I found on the internet this script in Perl... But the time is required to download take months.
#!/usr/bin/perl -w
#Based on www.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl
#Usage: perl ncbi_fetch.pl > output_file
use LWP::Simple;
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" ."db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s;
my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;
print STDERR "Count = $Count; QueryKey = $QueryKey; WebEnv = $WebEnv\n";
my $retstart=0;
my $retmax=100000;
while ($retstart<$Count) {
my $efetch = "$utils/efetch.fcgi?" .
"rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
print STDERR "Donwloading database $retstart / $Count\n";
my $efetch_result = get($efetch);
my $copy=$efetch_result;
my $countSeqs=0;
if ($report eq 'fasta') {
$countSeqs= $copy =~ tr/\>//;
}
elsif ($report eq 'genbank') {
$countSeqs = $copy =~ tr/\/\///;
}
elsif ($report eq 'acc') {
$countSeqs = $copy =~ tr/\n//;
}
my $expected=$retmax;
if ($retstart>$Count-$retmax) {
$expected=$Count-$retstart;
}
if ($countSeqs>=($expected-100000)) {
print "$efetch_result";
$retstart+=$retmax;
}
else {
print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n";
}
}
sub ask_user {
print STDERR "$_[0] [$_[1]]: ";
my $rc = <>;
chomp $rc;
if($rc eq "") {
die "Error: Empty field: $_[0]\n";
}
return $rc;
}
Do you know another way to do this?
Thanks in advance
This Python script that I wrote downloads the FASTA sequence of all proteins matching a keyword, across all species. It is configurable, though. See if you can avail of it: A: How to download all sequences of a list of proteins for a particular organism
Edit: You are looking for the actual amino acid sequence, I presume?
Thank you! I used your script and works perfectly. But one question, if I type in NCBI: human, there are many values that I'm not interested.
In this case: Animals(1,419,740) Plants(4,494) Fungi(898,540) Protists(203,856) Bacteria(84,639,903) Archaea(6,043) Viruses(1,749,114)
In my case, I would like to use: "Homo sapiens"[Organism] but in this case, your script doesn't work. Is there any solution for this?
Thanks again
I could see as well that only 20 sequences are downloaded in human :S.
Any solution to this problem?
Yes, for human data, just replace this line:
...with this:
txid9606 is a reference to Homo sapiens
Thanks for your reply.
I still having the same problem. Only download 20 sequences, as they appear on the website.