Hi Biostars,
I'm trying to intersect some taxonomy datasets and have encountered an issue with missing data. For example, taxid 106734, Chelonoidis abingdonii, is an island turtle and is in the class Reptilia, however the class is missing in the NCBI taxonomy. Does someone know of other taxonomy references that might have more complete taxonomy? I have a list of ~700k proteins and of those, ~30k are missing at least one taxonomic classification, however all have taxid that are found in NCBI.
In the below example I would hope to have 'Reptilia' as a class, but it isn't found....someone knows of another place to look for this?
$ cat fullnamelineage.dmp |grep 106734
106734 | Chelonoidis abingdonii | cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi
; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Sauropsida; Sauria; Archelosauria; Testudinata; Testudines; Cryptodira; Durocryptodira; Testudinoidea; Testudinidae; Chelonoidis; Chelonoidis nigra spe
cies complex; |
$ cat rankedlineage.dmp |grep 106734
106734 | Chelonoidis abingdonii | | Chelonoidis | Testudinidae | Testudines | | Chordata | Metazoa | Eukaryota |
And here is something slightly more readable via taxonkit;
$ echo 106734 | taxonkit lineage -t | csvtk cut -Ht -f 3 | csvtk unfold -Ht -f 1 -s ";" | taxonkit lineage -r -n -L | csvtk cut -Ht -f 1,3
,2 | csvtk pretty -H -t
131567 no rank cellular organisms
2759 superkingdom Eukaryota
33154 clade Opisthokonta
33208 kingdom Metazoa
6072 clade Eumetazoa
33213 clade Bilateria
33511 clade Deuterostomia
7711 phylum Chordata
89593 subphylum Craniata
7742 clade Vertebrata
7776 clade Gnathostomata
117570 clade Teleostomi
117571 clade Euteleostomi
8287 superclass Sarcopterygii
1338369 clade Dipnotetrapodomorpha
32523 clade Tetrapoda
32524 clade Amniota
8457 clade Sauropsida
32561 clade Sauria
1329799 clade Archelosauria
2841271 subclass Testudinata
8459 order Testudines
8464 suborder Cryptodira
1579337 clade Durocryptodira
8486 superfamily Testudinoidea
8487 family Testudinidae
904181 genus Chelonoidis
1137846 no rank Chelonoidis nigra species complex
106734 species Chelonoidis abingdonii
Can you describe the exact analysis you are doing and how absence of the
class
designation is affecting it. There are enough other classification categories that you could potentially use instead.It is possible that what you see is an oversight in the taxonomy database and you could write to NCBI help desk to see if it can be corrected.
Here's some R code run over the entirety of
rankedlineage.dmp
showing the missing data.I'm doing an msa of protein regions from
diamond blastp
and would like to split the sequences at different points in classification (prior to the msa). About 5% of the sequences (~33k/~700k) are missing a taxonomic classification of some type, most often ofclass
, which surprised me given that there is no reason why this should be missing. I guess I'll write NCBI, but if there is another solution to pursue this missing data I'd jump on the analysis.