This is the first time I'm running esearch on my computer (Ubuntu 20.04, Windows 10, running WSL2) so sorry if this is a dumb question.
I have a large text file with GenBank accession IDs that I'm trying to convert to taxonomy IDs. My input file is just a txt file with each ID in each row, like below:
$ cat /mnt/c/Users/username/Desktop/input.txt
SEM89725.1
WP_037213762.1
WP_058500402.1
...
I've went through past posts on Biostars and found that I can get a taxonomy ID for each individual ID (below is what I typed onto my Ubuntu terminal:)
esearch -db protein -query "SEM89725.1" | elink -target taxonomy | efetch -format uid > output.txt
But I'm hoping to repeat this search for thousands of GenBank accession IDs in my text file. I've tried this but this didn't work:
esearch -db protein -query /mnt/c/Users/username/Desktop/input.txt | elink -target taxonomy | efetch -format uid > /mnt/c/Users/username/Desktop/output.txt
1) Is there any way I could run my input.txt
file once and do all esearch
at once?
2) Also, is there any way to also include the query itself in the output, so that I can make sure which accession ID was converted to which taxonomic ID? Ideally, my output file would look like:
$ cat /mnt/c/Users/username/Desktop/output.txt
SEM89725.1 1173111
WP_037213762.1 334545
WP_058500402.1 402
...
Any advice would be really appreciated!
Thanks for this answer! I am using your method to get taxids based on RefSeq Assemblies, my problem is that I have 140000 Assembly ids, so the process takes very long. Do you have maybe a suggestion to speed up the process?