Question

retrieve all nt sequences for a taxid list

0

Entering edit mode

2.6 years ago

pe_se ▴ 10

Hi, I have a list of ~2000 taxids and would like to retrieve all available nucleotide sequences of each taxon to build a reference database. With batch entrez I only get an error, even when using only a single taxid or accession number (.txt or .xml). ["An illegal character in a token. Possible wrong file format. Request processing canceled."] Also doesn't work with this perl script -

perl -e 'use LWP::Simple;getstore("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=text&id=".join(",",qw(410645, 410645, ...)),"seqs.fasta");'

with "..." being the list of taxids; but also with only a few IDs it retrieves much fewer sequences than what is available on genbank. Not sure whats wrong there.

Can anyone advice how to compile those (with little to no coding skills...)? Thanks!

ncbi taxid efetch • 880 views

ADD COMMENT • link updated 2.6 years ago by GenoMax 148k • written 2.6 years ago by pe_se ▴ 10

1

Entering edit mode

Two options, none completely trival: either use command-line e-utils in a shell script and loop over all taxids read from a file, or download the whole NT database which you might already have and the NCBI taxonomy and create accession-lists for each taxid to add to pass to BLAST.

ADD REPLY • link 2.6 years ago by Michael 55k

score 3 · Answer 1 · 2022-06-21

You can use Entrezdirect:

$ more id
2104
2093
3256

$ for i in `cat id`; do echo ${i}; esearch -db nuccore -query "${i}[taxID]" ; done
2104
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>MCID_62b1aedbfe64814cb17d73bd</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>40356</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
2093
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>MCID_62b1aedcb1afcf6af8447fe6</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>119993</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
3256
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>MCID_62b1aedc97bf993ae11f75af</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>22818</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

To actually retrieve the sequences do something like this

$ for i in `cat id`; do echo ${i}; esearch -db nuccore -query "${i}[taxID]" | efetch -format fasta >> ${i}.fa; done