I am trying to access the NCBI
database from R, where I pass on a system command within R using system(..., intern = T)
and return the result back into a variable. The command is a query which retrieves the taxonomy, given a taxID
. It looks like the following:
batch <- system(paste0("esearch -db taxonomy -query \"",taxid.as.string, " [taxID]\" |
efetch -format xml | xtract -pattern Taxon -block \"*/Taxon\" -unless Rank
-equals \"no rank\" -tab \"\t\" -element Rank,ScientificName"), intern =
TRUE)
The variable above taxid.as.string
is a one-element vector and looks like:
> taxid.as.string
[1] "7070, 5741, 658858"
The command searches in the NCBI database for the TaxIDs 7070, 5741, and 658858
, to return the taxonomy for each.
My problem is that it does not return the taxonomy in the proper order.
Instead, it returns the result for 5741, 7070, and then 658858 TaxID
. I know that I can keep the vector numeric, loop over it and make a single query at a time.
Why is this the case? Is it possible to keep the order of the result, even if some taxonomies are returned faster?
Thanks in advance!
This has nothing to do with R. It is the way eutils works and unless they expose an option to preserve order (I highly doubt they would do that, given that it goes against efficiency), there is no way to change this except, as you say, querying one-by-one.
I think you should be able to query it all together and sort the result once you get it.
How can the results be sorted based on input's order if it only contains Rank and ScientificName!
If you don't have to do it in R, you can use bash similar to this post:
C: Using Entrez to find the taxonomy for an accession number
Cat reads a file sequentially, so you can make a file with each taxid in a separate row. The file shouldn't have any empty rows and no spaces after the taxids.
I do not want to run this query line by line for every single taxid as it takes a lot of time, that is why I was making queries in batches. Do you how I can include the taxids in the results?
By adding TaxId to the -element, I will get the taxid for every single taxonomy division, but I only want the TaxId which in this case is taxid.as.string.
Can't you use grep or sed to filter the output?
Could you give me a sample taxid for this?
Your command does not preserve taxid order, and bash code to preserve taxid order will be cumbersome.
I didn't claim that my command preserves the order :) I'm just trying to understand the question better and asked OP to provide me with an example of where the command doesn't work. Also the response in other thread is actually a bash code ;)