Question

Extracting data from a local NCBI taxonomy database

0

Entering edit mode

8 months ago

Begonia_pavonina ▴ 200

I have downloaded locally taxonomic data from NCBI using the E-utilities esearch and efetch. Here is my R script:

search_term <- "'large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]'"
db <- "nucleotide"

# Searching queries IDs:
system2("esearch", args = c("-db", db, "-query", search_term, "|", "efetch", "-format", "uid", "|", "awk", "'{printf \"%s,\", $0}'", "|", "sed", "'s/,$//' > ../data/ids.txt"), wait = TRUE)

# Fetch taxonomy data for each sequence ID:
system2("efetch", args = c("-db", "taxonomy", "-id", "$(cat ids.txt)", "-format", "xml", ">", "../data/taxonomy_data.xml"), wait = TRUE)

# Extract information from the xml file:
tax_ids <- xml_find_all(doc, "//TaxId") %>% xml_text()
scientific_names <- xml_find_all(doc, "//ScientificName") %>% xml_text()
lineage_names <- xml_find_all(doc, "//Lineage") %>% xml_text()

# Make the final dataframe
taxonomy_data <- data.frame(TaxId = tax_ids, ScientificName = scientific_names, Lineage=lineage_names)

Unfortunately I got an error at the last step, as the number of entries for tax_ids, scientific_names, lineage_names are not the same. I guess it is due to an awkward formatting of the NCBI taxonomic database.

I am trying to find a way to extract these information, and replace missing values in "ScientificName" and "Lineage" fields per "NaN". But I am really not good managing xml files. Would anyone have an advice to do so?

R NCBI xml E-utilities Blast • 741 views

ADD COMMENT • link updated 8 months ago by josev.die ▴ 70 • written 8 months ago by Begonia_pavonina ▴ 200

1

Entering edit mode

Non-R solutions for future visitors to this thread :

converting taxID to taxonomy
Retrieve species name using taxaIDs of NCBI

ADD REPLY • link 8 months ago by GenoMax 148k

score 0 · Answer 1 · 2024-04-01

Actually it was not that complicated, a simple loop through the taxons is doing the job:

# Read the XML file
taxonomy <- read_xml("../data/taxonomy_data.xml")

# Find all Taxon nodes
taxon_nodes <- xml_find_all(taxonomy, "//Taxon")

# Initialize empty lists to store data
taxonomy_list <- list()

# Iterate over each Taxon node
for (node in taxon_nodes) {
  # Extract TaxId, ScientificName, and Lineage for the current Taxon node
  tax_id <- xml_text(xml_find_first(node, ".//TaxId"))
  scientific_name <- xml_text(xml_find_first(node, ".//ScientificName"))
  lineage <- xml_text(xml_find_first(node, ".//Lineage"))

  # Append the extracted values to the taxonomy_list
  taxonomy_list <- c(taxonomy_list, list(list(TaxId = tax_id, 
                                              ScientificName = scientific_name, 
                                              Lineage = lineage)))
}

# Convert the list of lists into a data frame
taxonomy_data <- do.call(rbind, lapply(taxonomy_list, unlist))

score 0 · Answer 2 · 2024-04-14

I have modified my previous function get_taxonomy, which takes a vector of taxonomy ids., to include the 'lineage' . The approach is :

get the "nucleotide" ids from a given search
link those nucleotide ids. to the "taxonomy" database (taxonomy ids.)
get the taxonomy lineages for those taxonomy ids.

The following example should work for the first 20 nucleotide ids with link into the taxonomy database.

# Dependencies
library(rentrez)
search_term <- "large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]"
db <- "nucleotide"

# nucleotide ids 
esearch = entrez_search(db = 'nuccore', term = search_term)

# convert nucleotide ids to taxonomy ids
etax = entrez_link(dbfrom = 'nuccore', db = 'taxonomy', id = esearch$ids)
ids = etax$links$nuccore_taxonomy

# programatic access to ENTREZ : xml file
source("https://raw.githubusercontent.com/jdieramon/my_scripts/master/byRequest/functions.R")

# extract data as tibble 
get_taxonomy_lineage(ids)

# A tibble: 17 × 3
tax_id  species                                             lineage                                                               
<chr>   <chr>                                               <chr>                                                                 
1 2873770 Tabellaria sp.                                      cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
2 2728209 Melosira sp.                                        cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
3 1891026 Thalassiosira sp.                                   cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
4 1884248 Nitzschia sp. (in: diatoms)                         cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
5 1881117 Nitzschia traheaformis                              cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
6 1763363 cyanobacterium endosymbiont of Rhopalodia gibberula cellular organisms; Bacteria; Terrabacteria group; Cyanobacteriota/Me…
7 1526603 Surirella sp.                                       cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
8 1357546 Richelia sinica                                     cellular organisms; Bacteria; Terrabacteria group; Cyanobacteriota/Me…
9 370345  Batillaria attramentaria                            cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bila…
10 210441  Asterionella formosa                                cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
11 186043  Entomoneis sp.                                      cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
12 186031  Nitzschia communis                                  cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
13 185980  Fragilaria capucina                                 cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
14 79200   Daucus carota                                       cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptoph…
15 35126   Cyclotella sp.                                      cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
16 2856    Cylindrotheca closterium                            cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…
17 2854    Cylindrotheca sp.                                   cellular organisms; Eukaryota; Sar; Stramenopiles; Ochrophyta; Bacill…