I have downloaded locally taxonomic data from NCBI using the E-utilities esearch and efetch. Here is my R script:
search_term <- "'large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]'"
db <- "nucleotide"
# Searching queries IDs:
system2("esearch", args = c("-db", db, "-query", search_term, "|", "efetch", "-format", "uid", "|", "awk", "'{printf \"%s,\", $0}'", "|", "sed", "'s/,$//' > ../data/ids.txt"), wait = TRUE)
# Fetch taxonomy data for each sequence ID:
system2("efetch", args = c("-db", "taxonomy", "-id", "$(cat ids.txt)", "-format", "xml", ">", "../data/taxonomy_data.xml"), wait = TRUE)
# Extract information from the xml file:
tax_ids <- xml_find_all(doc, "//TaxId") %>% xml_text()
scientific_names <- xml_find_all(doc, "//ScientificName") %>% xml_text()
lineage_names <- xml_find_all(doc, "//Lineage") %>% xml_text()
# Make the final dataframe
taxonomy_data <- data.frame(TaxId = tax_ids, ScientificName = scientific_names, Lineage=lineage_names)
Unfortunately I got an error at the last step, as the number of entries for tax_ids
, scientific_names
, lineage_names
are not the same. I guess it is due to an awkward formatting of the NCBI taxonomic database.
I am trying to find a way to extract these information, and replace missing values in "ScientificName" and "Lineage" fields per "NaN". But I am really not good managing xml files. Would anyone have an advice to do so?
Non-R solutions for future visitors to this thread :
converting taxID to taxonomy
Retrieve species name using taxaIDs of NCBI