There is a rankedlineage.dmp
file in the new_taxdump
directory at NCBI that has these ranks.
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
tar -zxvf new_taxdump.tar.gz
Here's one way to format the file using R and tidyverse packages (and you'll still need to figure out how to replace the header). The dmp file is tab and pipe-separated, so skip the pipes using - in col_types.
library(tidyverse)
tax <- read_tsv("rankedlineage.dmp",
col_names = c("id", "name", "s", "g", "f", "o", "c","p", "k", "d"),
col_types=("i-c-c-c-c-c-c-c-c-c-"))
Next, gather columns 2 to 10 into two columns with the column name and value, remove NAs, and unite into a single column.
x2 <- gather(tax, "key", "value", 2:10) %>%
filter(!is.na(value)) %>%
unite(2:3, col="name", sep="=")
x2
# A tibble: 11,697,321 x 2
id name
<int> <chr>
1 1 name=root
2 131567 name=cellular organisms
3 2157 name=Archaea
4 1935183 name=Asgard group
Finally, group by tax id to get the lineage (this groups 12 million rows, so it will take a while. I did not time it, but maybe 5-10 minutes?)
tax2 <- group_by(x2, id) %>%
summarize(lineage= paste(name, collapse="; "))
filter(tax2, id %in% c(669613, 15371) )
# A tibble: 2 x 2
id lineage
<int> <chr>
1 15371 name=Bromus inermis; g=Bromus; f=Poaceae; o=Poales; c=Liliopsida; p=Streptophyta; k=Viridiplantae; d=Eukaryota
2 669613 name=Koenigia alaskana; g=Koenigia; f=Polygonaceae; o=Caryophyllales; p=Streptophyta; k=Viridiplantae; d=Eukaryota
Hi Chris,
Good to know this
rankedlineage.dmp
. It must be much faster than what I suggested.Thank you! This saved me heaps of time.
Hi mariahaguiar001
I have the same challenge of replace by tax full lineage, but I have accesion id´s instead of ncbi taxid, Did you find a not manual way to replace the headers?
First you have to convert accession to NCBI tax id then you can get full lineage using following R function
Hi Chris This is super helpful. I have .txt file with close to 1000 taxon ids How do I filter the x2 table (tibble ? sorry new to tidyverse) based on the list of taxon ids in another file? Thanks Ewelina