Question

How do I make this perl script work to fetch sequences from NCBI using gene symbols?

1

Entering edit mode

6.2 years ago

MAPK ★ 2.1k

I have a text file called org.txt. I also have this perl script below. This script works if I just have $name="SS1G_03709";, but doesn't work when I want to loop over all gene symbols. I tried to loop over each $name and print the output in test_organism_seqds.fa file, but there seems to be something wrong reading file and in looping step. I am new in perl so I would really appreciate if someone could help me resolve this issue? thanks!

org.txt

SS1G_03709
SS1G_07286
SS1G_06430
SS1G_01676
SS1G_08825
SS1G_01347

code:

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;

open (my $OUT, '>', '/home/owner/test_organism_seqs.fa') || die "Can't open file:$!";

open(INFILE,"</org.txt>){
    chomp;
    my @names = split('\n', $_);
    foreach my $name(@names){

my $db = 'nuccore';
my $query = "$name+AND+srcdb_refseq[PROP]";

#base URL
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $url= $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#Run the search using the URL created above
my $output = get($url);

#Web Environment. This parameter specifies the Web Environment that 
#contains the UID list to be provided as input to ESummary. Usually 
#this WebEnv value is obtained from the output of a previous ESearch, 
#EPost or ELink call. The WebEnv parameter must be used in 
#conjunction with query_key.
my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);

#Query key. This integer specifies which of the UID lists attached to the given 
#Web Environment will be used as input to ESummary.  Query keys are obtained
# from the output of previous ESearch, EPost or ELink calls.  The query_key 
#parameter must be used in conjunction with WebEnv.
my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

$url = $base . "esummary.fcgi?db=$db&query_key=$key&WebEnv=$web";

#Run the search using the esummary URL created above
my $docsums = get($url);

$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url.= "&rettype=fasta&retmode=text";

#Run the search using the efetch URL created above.
my $data = get($url);
print $OUT "$data";
}
}

close $OUT;
exit;

perl eutils • 3.0k views

ADD COMMENT • link updated 5.7 years ago by josev.die ▴ 70 • written 6.2 years ago by MAPK ★ 2.1k

1

Entering edit mode

You don't have to use this perl script for retrieving sequences. See this recent comment for inspiration. It should be possible to figure out changes you need in that command line (hints: change database, use efetch -format fasta etc).

ADD REPLY • link 6.2 years ago by GenoMax 148k

0

Entering edit mode

Thanks, so the db can be "gene"? I don't want to fetch all different nucleotides for the given gene symbol, but only gene sequence.

ADD REPLY • link 6.2 years ago by MAPK ★ 2.1k

1

Entering edit mode

Your identifiers do not appear to be in the gene database so you can't use that. nuccore is your choice there. Pay attention to the query since you will need to change it unless you want to download entire genome sequences.

ADD REPLY • link 6.2 years ago by GenoMax 148k

0

Entering edit mode

@genomax Sorry, but found it a bit tricky. Tried something like this efetch -format fasta -db nuccore -query "SS1G_03709"+AND+srcdb_refseq[PROP]+AND+[GENE]", but won't get anything.

ADD REPLY • link 6.2 years ago by MAPK ★ 2.1k

1

Entering edit mode

-query is not a legal parameter for efetch. You need to use use it with esearch first, then pipe the output to efetch. Also, your query string should be changed to indicate that the term SS1G_03709 is the term for [GENE]. With those changes, the command will be:

esearch -db nuccore -query "SS1G_03709[GENE] AND srcdb_refseq[PROP]" | efetch -format acc
NW_001820832.1
XM_001595570.1

Note, I used -format acc with efetch here for brevity. Change it to -fromat fasta for sequence in FASTA format. That said, do you need the sequence of the genomic RefSeq as well? If not, add AND biomol_rna[PROP] to your query and that'll return only RefSeq RNAs.

ADD REPLY • link 6.2 years ago by vkkodali_ncbi ★ 3.8k

score 0 · Answer 1 · 2019-04-14

If you are open to use R, this code will do the work for you.

#Dependency
library(rentrez)

 #Loop over your gene symbols, fetch the protein sequence and put it into a fasta file
save_AAfasta <- function(IDs, nameFile) {
     for(i in seq(length(IDs))) {
      myterm = paste0(org[1],'[ALL]')
      mysearch = entrez_search(db = 'gene', term = myterm)
      prot.link = entrez_link(dbfrom = 'gene', id = mysearch$ids, db = 'protein')
      prot_id = prot.link$links$gene_protein_refseq
      protein_fasta <- entrez_fetch(db="protein", id=prot_id, rettype="fasta")
      write(protein_fasta, file= paste(nameFile, ".fasta", sep = ""), append = TRUE)
     }
  }

Now, you can define a vector with your gene symbols and just call the function:

org = c('SS1G_03709','SS1G_07286','SS1G_06430','SS1G_01676','SS1G_08825',
    'SS1G_01347')
save_AAfasta(org, "Downloads/myproteins")