The two duplicates (VDO54167.1, VDM28456.1) are just returned once. I would like to keep them separate if possible. The order of the list is changed too, but that is just a minor issue.
for acc in `cat file.txt`; do
esummary -db protein -id ${acc} \
| xtract -pattern DocumentSummary -element Caption,TaxId ;
done
This will retain the order of the accessions as well. However, this will be slower than using epost, more so if you have a long list of accessions to work with. Alternatively, you can first fetch the taxids using epost method first and then use unix join to add the taxid to your original list. Perhaps something like this:
Note, I have used AccessionVersion instead of Caption in my xtract to fetch the entire accession.version string instead of just the accession. Also, the join method will require you to sort your data as well so this will destroy the sort order of accessions but there are other methods that you can use to keep the sort order of your original data.
Thank you so much for this nice solution! Just out of curiosity, is there anything known why epost does resort and remove duplicates? I did not find anything about that in the documentation.
Thank you so much for this nice solution! Just out of curiosity, is there anything known why epost does resort and remove duplicates? I did not find anything about that in the documentation.
Thanks again, JD