How to retrieve large datasets for proteins
0
0
Entering edit mode
8.5 years ago
Naresh ▴ 60

Under Entrez Programming Utilities Help, Application 3 says that you can RETRIEVE LARGE DATASETS. Here Chimpanzee example is given and we can retrieve mRNA sequences. But my interest is to retrieve protein sequences of my analysis. I tried the same script in PERL, in place of mRNA, i made protein and also .faa.gz files.

But i cannot get the output. Please guide me.

Thanks Naresh

sequence • 2.0k views
ADD COMMENT
0
Entering edit mode

If you have all the gi list of your protein of interest. you can use eutility's EFETCH option to retrieve n number of sequences

ADD REPLY
0
Entering edit mode

Tell us more precisely what you've tried. My guess is that you didn't write the url correctly. To get a protein sequence given a GI is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id=...

ADD REPLY
0
Entering edit mode
use LWP::Simple;
$query = 'Schizosaccharomyces pombe[orgn]+AND+biomol+mrna[prop]';

#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";

#post the esearch URL
$output = get($url);

#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

#open output file for writing
open(OUT, ">Schizosaccharomyces pombe.fna") || die "Can't open file!\n";

#retrieve data in batches of 500
$retmax = 500;
for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
        $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
        $efetch_url .= "&query_key=$key&retstart=$retstart";
        $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
        $efetch_out = get($efetch_url);
        print OUT "$efetch_out";
}
ADD REPLY
0
Entering edit mode

The opening and closing tags in the regexes don't match e.g. you have <webenv> and <\/WebEnv>. Make sure that they match what's actually returned in $output otherwise, you won't get anything.

EDIT: For proteins, you need $query = 'Schizosaccharomyces pombe[orgn]'; and $url = $base . "esearch.fcgi?db=protein&... and $efetch_url = $base ."efetch.fcgi?db=protein&WebEnv=$web";

ADD REPLY
0
Entering edit mode

This is the output..

C:\Users\Naresh\Documents>perl C:\Users\Naresh\perl_tests\hello_worldpl
Bareword found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near ""esearch.fcgi?db=protein&$efetch_url =$base . "efetch"
        (Missing operator before efetch?)
Operator or semicolon missing before &WebEnv at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
Ambiguous use of & resolved as operator & at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
String found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near "open(OUT, ""
        (Missing semicolon on previous line?)
String found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near "faa") || die ""
Bareword found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near "") || die "Can't"
        (Missing operator before Can't?)
Precedence problem: open file should be open(file) at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
String found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near "$efetch_url = $base .""
        (Missing semicolon on previous line?)
Operator or semicolon missing before &WebEnv at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
Ambiguous use of & resolved as operator & at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
String found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near "$efetch_url .= ""
        (Missing semicolon on previous line?)
Operator or semicolon missing before &retstart at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
Ambiguous use of & resolved as operator & at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
String found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near "$efetch_url .= ""
        (Missing semicolon on previous line?)
Operator or semicolon missing before &rettype at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
Ambiguous use of & resolved as operator & at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
Operator or semicolon missing before &retmode at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
Ambiguous use of & resolved as operator & at C:\Users\Naresh\perl_tests\hello_worldpl line 6.
String found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near "print OUT ""
        (Missing semicolon on previous line?)
String found where operator expected at C:\Users\Naresh\perl_tests\hello_worldpl line 6, at end of line
        (Missing semicolon on previous line?)
syntax error at C:\Users\Naresh\perl_tests\hello_worldpl line 6, near ""esearch.fcgi?db=protein&$efetch_url =$base . "efetch"
Can't find string terminator '"' anywhere before EOF at C:\Users\Naresh\perl_tests\hello_worldpl line 6.

C:\Users\Naresh\Documents>
ADD REPLY
0
Entering edit mode
use LWP::Simple;
$query = 'Schizosaccharomyces pombe[orgn]';

#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=protein&$efetch_url =$base . "efetch.fcgi?db=protein&WebEnv=$web";

#post the esearch URL
$output = get($url);

#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

#open output file for writing
open(OUT, ">Schizosaccharomyces pombe.faa") || die "Can't open file!\n";

#retrieve data in batches of 500
$retmax = 500;
for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
        $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
        $efetch_url .= "&query_key=$key&retstart=$retstart";
        $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
        $efetch_out = get($efetch_url);
        print OUT "$efetch_out";
}
ADD REPLY
0
Entering edit mode

Don't just copy/paste code from a web page. It may not be properly formatted. For example, things like 'assemble the esearch URL' are comments not code so they should be written as proper perl comments. Also you still haven't corrected the regexes. The difference between upper and lower case is meaningful. If you don't know how to program in perl, I suggest you at least have a quick look at a tutorial. Also when posting code, please try to format it for readability.

ADD REPLY
0
Entering edit mode

I never did Perl. I will learn now. Sorry for not formating it for readability.

ADD REPLY

Login before adding your answer.

Traffic: 1994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6