Question

esearch get taxonomy ID from a large list of accession IDs

0

Entering edit mode

3.0 years ago

garfield320 ▴ 20

This is the first time I'm running esearch on my computer (Ubuntu 20.04, Windows 10, running WSL2) so sorry if this is a dumb question.

I have a large text file with GenBank accession IDs that I'm trying to convert to taxonomy IDs. My input file is just a txt file with each ID in each row, like below:

$ cat /mnt/c/Users/username/Desktop/input.txt
SEM89725.1
WP_037213762.1
WP_058500402.1
...

I've went through past posts on Biostars and found that I can get a taxonomy ID for each individual ID (below is what I typed onto my Ubuntu terminal:)

esearch -db protein -query "SEM89725.1" | elink -target taxonomy | efetch -format uid > output.txt

But I'm hoping to repeat this search for thousands of GenBank accession IDs in my text file. I've tried this but this didn't work:

esearch -db protein -query /mnt/c/Users/username/Desktop/input.txt | elink -target taxonomy | efetch -format uid > /mnt/c/Users/username/Desktop/output.txt

1) Is there any way I could run my input.txt file once and do all esearch at once? 2) Also, is there any way to also include the query itself in the output, so that I can make sure which accession ID was converted to which taxonomic ID? Ideally, my output file would look like:

$ cat /mnt/c/Users/username/Desktop/output.txt
SEM89725.1    1173111
WP_037213762.1    334545
WP_058500402.1    402
...

Any advice would be really appreciated!

esearch ubuntu • 2.3k views

ADD COMMENT • link updated 16 months ago by Bertalan_Takacs ▴ 140 • written 3.0 years ago by garfield320 ▴ 20

score 1 · Answer 1 · 2022-07-26

1

Entering edit mode

3.0 years ago

GenoMax 152k

File id below contains your id's one per line

$ for i in `cat id`; do printf ${i}"\t"; esearch -db protein -query ${i} | elink -target taxonomy | efetch -format docsum | xtract -pattern DocumentSummary -element TaxId; done
SEM89725.1      1173111
WP_037213762.1  334545
WP_058500402.1  454

If you need the organism name then

$ for i in `cat id`; do printf ${i}"\t"; esearch -db protein -query ${i} | elink -target taxonomy | efetch -format docsum | xtract -pattern DocumentSummary -element ScientificName,TaxId; done
SEM89725.1      Lihuaxuella thermophila 1173111
WP_037213762.1  Rickettsia tamurae      334545
WP_058500402.1  Legionella israelensis  45

ADD COMMENT • link 3.0 years ago by GenoMax 152k

0

Entering edit mode

Thanks for this answer! I am using your method to get taxids based on RefSeq Assemblies, my problem is that I have 140000 Assembly ids, so the process takes very long. Do you have maybe a suggestion to speed up the process?

ADD REPLY • link 16 months ago by Bertalan_Takacs ▴ 140

score 0 · Answer 2 · 2023-03-15

You can also do that with blastdbcmd if you have a local blast installation with the nr database:

marco@blast$ blastdbcmd -db nr -entry_batch test.txt -outfmt "%a %T" -target_only
SEM89725.1 1173111
WP_037213762.1 334545
WP_058500402.1 454

The drawback is that the nr database is quite large and requires a decent amount of ram to work, so if you're on a laptop I'm afraid you can't get it running. Also, -target_only is recently not always working as expected, so be careful if you want to go for it (see blastdbcmd error with -target_only option if you want to know the details). I reported it to ncbi and hopefully they'll fix it soon.