missing data in taxonomy
0
0
Entering edit mode
8 weeks ago
noodle ▴ 640

Hi Biostars,

I'm trying to intersect some taxonomy datasets and have encountered an issue with missing data. For example, taxid 106734, Chelonoidis abingdonii, is an island turtle and is in the class Reptilia, however the class is missing in the NCBI taxonomy. Does someone know of other taxonomy references that might have more complete taxonomy? I have a list of ~700k proteins and of those, ~30k are missing at least one taxonomic classification, however all have taxid that are found in NCBI.

In the below example I would hope to have 'Reptilia' as a class, but it isn't found....someone knows of another place to look for this?

$ cat fullnamelineage.dmp |grep 106734
106734  |       Chelonoidis abingdonii  |       cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi
; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Sauropsida; Sauria; Archelosauria; Testudinata; Testudines; Cryptodira; Durocryptodira; Testudinoidea; Testudinidae; Chelonoidis; Chelonoidis nigra spe
cies complex;   |
$ cat rankedlineage.dmp |grep 106734
106734  |       Chelonoidis abingdonii  |               |       Chelonoidis     |       Testudinidae    |       Testudines      |               |       Chordata        |       Metazoa |       Eukaryota       |

And here is something slightly more readable via taxonkit;

$ echo 106734     | taxonkit lineage -t     | csvtk cut -Ht -f 3     | csvtk unfold -Ht -f 1 -s ";"     | taxonkit lineage -r -n -L     | csvtk cut -Ht -f 1,3
,2     | csvtk pretty -H -t 
131567    no rank        cellular organisms               
2759      superkingdom   Eukaryota                        
33154     clade          Opisthokonta                     
33208     kingdom        Metazoa                          
6072      clade          Eumetazoa                        
33213     clade          Bilateria                        
33511     clade          Deuterostomia                    
7711      phylum         Chordata                         
89593     subphylum      Craniata                         
7742      clade          Vertebrata                       
7776      clade          Gnathostomata                    
117570    clade          Teleostomi                       
117571    clade          Euteleostomi                     
8287      superclass     Sarcopterygii                    
1338369   clade          Dipnotetrapodomorpha             
32523     clade          Tetrapoda                        
32524     clade          Amniota                          
8457      clade          Sauropsida                       
32561     clade          Sauria                           
1329799   clade          Archelosauria                    
2841271   subclass       Testudinata                      
8459      order          Testudines                       
8464      suborder       Cryptodira                       
1579337   clade          Durocryptodira                   
8486      superfamily    Testudinoidea                    
8487      family         Testudinidae                     
904181    genus          Chelonoidis                      
1137846   no rank        Chelonoidis nigra species complex
106734    species        Chelonoidis abingdonii 
taxonomy ncbi • 372 views
ADD COMMENT
1
Entering edit mode

I'm trying to intersect some taxonomy datasets

Can you describe the exact analysis you are doing and how absence of the class designation is affecting it. There are enough other classification categories that you could potentially use instead.

It is possible that what you see is an oversight in the taxonomy database and you could write to NCBI help desk to see if it can be corrected.

ADD REPLY
1
Entering edit mode

Here's some R code run over the entirety of rankedlineage.dmp showing the missing data.

library(data.table)
rankedlineage <- fread("rankedlineage.dmp",sep="|", quote="\t")
rankedlineage <- rankedlineage[,1:10]
> colnames(rankedlineage) <- c("tax_id","tax_name","species","genus","family","order","class","phylum","kingdom","superkingdom")
> table((rankedlineage$family==""))/dim(rankedlineage)[1]

    FALSE      TRUE 
0.8694066 0.1305934 
> table((rankedlineage$order==""))/dim(rankedlineage)[1]

     FALSE       TRUE 
0.91729411 0.08270589 
> table((rankedlineage$class==""))/dim(rankedlineage)[1]

     FALSE       TRUE 
0.94146283 0.05853717 
> table((rankedlineage$phylum==""))/dim(rankedlineage)[1]

     FALSE       TRUE 
0.95156568 0.04843432 
ADD REPLY
0
Entering edit mode

I'm doing an msa of protein regions from diamond blastp and would like to split the sequences at different points in classification (prior to the msa). About 5% of the sequences (~33k/~700k) are missing a taxonomic classification of some type, most often of class, which surprised me given that there is no reason why this should be missing. I guess I'll write NCBI, but if there is another solution to pursue this missing data I'd jump on the analysis.

ADD REPLY

Login before adding your answer.

Traffic: 2727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6