Entering edit mode
6.2 years ago
MAPK
★
2.1k
I have a text file called org.txt
. I also have this perl script below. This script works if I just have $name="SS1G_03709";
, but doesn't work when I want to loop over all gene symbols. I tried to loop over each $name and print the output in test_organism_seqds.fa file, but there seems to be something wrong reading file and in looping step. I am new in perl so I would really appreciate if someone could help me resolve this issue? thanks!
org.txt
SS1G_03709
SS1G_07286
SS1G_06430
SS1G_01676
SS1G_08825
SS1G_01347
code:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
open (my $OUT, '>', '/home/owner/test_organism_seqs.fa') || die "Can't open file:$!";
open(INFILE,"</org.txt>){
chomp;
my @names = split('\n', $_);
foreach my $name(@names){
my $db = 'nuccore';
my $query = "$name+AND+srcdb_refseq[PROP]";
#base URL
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $url= $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";
#Run the search using the URL created above
my $output = get($url);
#Web Environment. This parameter specifies the Web Environment that
#contains the UID list to be provided as input to ESummary. Usually
#this WebEnv value is obtained from the output of a previous ESearch,
#EPost or ELink call. The WebEnv parameter must be used in
#conjunction with query_key.
my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
#Query key. This integer specifies which of the UID lists attached to the given
#Web Environment will be used as input to ESummary. Query keys are obtained
# from the output of previous ESearch, EPost or ELink calls. The query_key
#parameter must be used in conjunction with WebEnv.
my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";
#Run the search using the esummary URL created above
my $docsums = get($url);
$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url.= "&rettype=fasta&retmode=text";
#Run the search using the efetch URL created above.
my $data = get($url);
print $OUT "$data";
}
}
close $OUT;
exit;
You don't have to use this perl script for retrieving sequences. See this recent comment for inspiration. It should be possible to figure out changes you need in that command line (hints: change database, use
efetch -format fasta
etc).Thanks, so the db can be "gene"? I don't want to fetch all different nucleotides for the given gene symbol, but only gene sequence.
Your identifiers do not appear to be in the
gene
database so you can't use that.nuccore
is your choice there. Pay attention to the query since you will need to change it unless you want to download entire genome sequences.@genomax Sorry, but found it a bit tricky. Tried something like this
efetch -format fasta -db nuccore -query "SS1G_03709"+AND+srcdb_refseq[PROP]+AND+[GENE]"
, but won't get anything.-query
is not a legal parameter forefetch
. You need to use use it withesearch
first, then pipe the output toefetch
. Also, your query string should be changed to indicate that the termSS1G_03709
is the term for[GENE]
. With those changes, the command will be:Note, I used
-format acc
withefetch
here for brevity. Change it to-fromat fasta
for sequence in FASTA format. That said, do you need the sequence of the genomic RefSeq as well? If not, addAND biomol_rna[PROP]
to your query and that'll return only RefSeq RNAs.