esearch get taxonomy ID from a large list of accession IDs
2
0
Entering edit mode
2.5 years ago
garfield320 ▴ 20

This is the first time I'm running esearch on my computer (Ubuntu 20.04, Windows 10, running WSL2) so sorry if this is a dumb question.

I have a large text file with GenBank accession IDs that I'm trying to convert to taxonomy IDs. My input file is just a txt file with each ID in each row, like below:

$ cat /mnt/c/Users/username/Desktop/input.txt
SEM89725.1
WP_037213762.1
WP_058500402.1
...

I've went through past posts on Biostars and found that I can get a taxonomy ID for each individual ID (below is what I typed onto my Ubuntu terminal:)

esearch -db protein -query "SEM89725.1" | elink -target taxonomy | efetch -format uid > output.txt

But I'm hoping to repeat this search for thousands of GenBank accession IDs in my text file. I've tried this but this didn't work:

esearch -db protein -query /mnt/c/Users/username/Desktop/input.txt | elink -target taxonomy | efetch -format uid > /mnt/c/Users/username/Desktop/output.txt

1) Is there any way I could run my input.txt file once and do all esearch at once? 2) Also, is there any way to also include the query itself in the output, so that I can make sure which accession ID was converted to which taxonomic ID? Ideally, my output file would look like:

$ cat /mnt/c/Users/username/Desktop/output.txt
SEM89725.1    1173111
WP_037213762.1    334545
WP_058500402.1    402
...

Any advice would be really appreciated!

esearch ubuntu • 1.8k views
ADD COMMENT
1
Entering edit mode
2.5 years ago
GenoMax 148k

File id below contains your id's one per line

$ for i in `cat id`; do printf ${i}"\t"; esearch -db protein -query ${i} | elink -target taxonomy | efetch -format docsum | xtract -pattern DocumentSummary -element TaxId; done
SEM89725.1      1173111
WP_037213762.1  334545
WP_058500402.1  454

If you need the organism name then

$ for i in `cat id`; do printf ${i}"\t"; esearch -db protein -query ${i} | elink -target taxonomy | efetch -format docsum | xtract -pattern DocumentSummary -element ScientificName,TaxId; done
SEM89725.1      Lihuaxuella thermophila 1173111
WP_037213762.1  Rickettsia tamurae      334545
WP_058500402.1  Legionella israelensis  45
ADD COMMENT
0
Entering edit mode

Thanks for this answer! I am using your method to get taxids based on RefSeq Assemblies, my problem is that I have 140000 Assembly ids, so the process takes very long. Do you have maybe a suggestion to speed up the process?

ADD REPLY
0
Entering edit mode
22 months ago
biomarco ▴ 50

You can also do that with blastdbcmd if you have a local blast installation with the nr database:

marco@blast$ blastdbcmd -db nr -entry_batch test.txt -outfmt "%a %T" -target_only
SEM89725.1 1173111
WP_037213762.1 334545
WP_058500402.1 454

The drawback is that the nr database is quite large and requires a decent amount of ram to work, so if you're on a laptop I'm afraid you can't get it running. Also, -target_only is recently not always working as expected, so be careful if you want to go for it (see blastdbcmd error with -target_only option if you want to know the details). I reported it to ncbi and hopefully they'll fix it soon.

ADD COMMENT

Login before adding your answer.

Traffic: 3510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6