Try TaxonKit (Cross-platform and Efficient NCBI Taxonomy Toolkit)
with the lineage
subcommand (usage which querys full lineage of given taxids from file.
TaxonKit is a command-line tool written in Go programming language,
executable binary files for most popular operating system are freely available in download page.
Just download compressed executable file of your operating system, uncompress it and run.
It's very fast!
NCBI taxonomy data is needed: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Example data:
$ cat t.taxid
349741
834
Query lineage:
$ taxonkit lineage --nodes nodes.dmp --names names.dmp t.taxid
349741 cellular organisms;cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
834 cellular organisms;cellular organisms;Bacteria;FCB group;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes;Fibrobacter succinogenes subsp. succinogenes
Qiime-like format can be obtained by flag -f
:
$ taxonkit lineage --nodes nodes.dmp --names names.dmp -f t.taxid
349741 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila
834 k__Bacteria;p__Fibrobacteres;c__Fibrobacteria;o__Fibrobacterales;f__Fibrobacteraceae;g__Fibrobacter;s__Fibrobacter succinogenes;S__Fibrobacter succinogenes subsp. succinogenes
You can also extract custom levels of rank with reformat
(usage).
The default format is {k};{p};{c};{o};{f};{g};{s}
:
$ taxonkit lineage --nodes nodes.dmp --names names.dmp t.taxid | cut -f 2 | taxonkit reformat | cut -f 2
Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Bacteria;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes
The file 'names.dmp' has four columns. The first column is the taxid, the second column is a name, and the fourth column is the class of the name. A taxid may have assigned several names but each of these names has a different 'class'. Every taxid has exactly one name of class 'scientific name', while the other classes are optional. Thus you can restrict your search to lines having 'scientific name' in the forth column. Please compare the output of these two awk searches:
awk -F '|' '$1==9606' names.dmp
awk -F '|' '$1==9606 && $4~/scientific name/' names.dmp
Unfortunately, 'names.dmp' is a bit nasty to parse due to abundant and unnesserary white space in it.