Hi all,
I have lists and lists of taxids and I want to get the species names along with family if possible. The names.dmp file only shows the scientific name and sometimes there are multiple names for the same taxid, but I noticed that when using the taxonomy browser (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi) they always give just one name. So is it using a different names.dmp file or does it just give the first name hit for that taxid.
Furthermore, is there any file similar to names.dmp that gives family names for a taxid??
I should add that I have used the python package ete3 to get lineage and scientific names from taxid, but the output of the lineages is soo messy (they are not in neat columns for my to extract nor does it produce an output with clear headers) that I can't spend hours going through my hundreds of files to figure out which is the family name. For example, here is the lineage output I obtained from a small sample of my taxids:
taxid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
156304 root Eukaryota Eumetazoa Arthropoda Hexapoda Hymenoptera Apocrita Aculeata Apidae Pterygota Opisthokonta Metazoa Bilateria Protostomia Neoptera Endopterygota Apoidea Insecta Xylocopinae Ceratinini Ceratina Dicondylia Panarthropoda cellular organisms Ceratina calcarata Pancrustacea Mandibulata Zadontomerus Ecdysozoa
65598 root Eukaryota Eumetazoa Arthropoda Hexapoda Hymenoptera Apocrita Aculeata Apidae Pterygota Bombus Opisthokonta Metazoa Bilateria Protostomia Neoptera Endopterygota Apoidea Insecta Bombus pascuorum Apinae Bombini Dicondylia Panarthropoda cellular organisms Thoracobombus Pancrustacea Mandibulata Ecdysozoa
938226 root Eukaryota Eumetazoa Arthropoda Hexapoda Lepidoptera Noctuidae Pterygota Opisthokonta Metazoa Bilateria Protostomia Neoptera Endopterygota Ditrysia Noctuoidea Glossata Neolepidoptera Heteroneura Insecta Dicondylia Amphiesmenoptera Panarthropoda Acronictinae Obtectomera cellular organisms Pancrustacea Mandibulata Craniophora Craniophora ligustri Ecdysozoa
156304 root Eukaryota Eumetazoa Arthropoda Hexapoda Hymenoptera Apocrita Aculeata Apidae Pterygota Opisthokonta Metazoa Bilateria Protostomia Neoptera Endopterygota Apoidea Insecta Xylocopinae Ceratinini Ceratina Dicondylia Panarthropoda cellular organisms Ceratina calcarata Pancrustacea Mandibulata Zadontomerus Ecdysozoa
112596 root Viruses Myoviridae Caudovirales Wolbachia phage WO unclassified Myoviridae Duplodnaviria Heunggongvirae Uroviricota Caudoviricetes
65598 root Eukaryota Eumetazoa Arthropoda Hexapoda Hymenoptera Apocrita Aculeata Apidae Pterygota Bombus Opisthokonta Metazoa Bilateria Protostomia Neoptera Endopterygota Apoidea Insecta Bombus pascuorum Apinae Bombini Dicondylia Panarthropoda cellular organisms Thoracobombus Pancrustacea Mandibulata Ecdysozoa
156304 root Eukaryota Eumetazoa Arthropoda Hexapoda Hymenoptera Apocrita Aculeata Apidae Pterygota Opisthokonta Metazoa Bilateria Protostomia Neoptera Endopterygota Apoidea Insecta Xylocopinae Ceratinini Ceratina Dicondylia Panarthropoda cellular organisms Ceratina calcarata Pancrustacea Mandibulata Zadontomerus Ecdysozoa
85660 root Eukaryota Eumetazoa Arthropoda Hexapoda Hymenoptera Apocrita Aculeata Apidae Pterygota Bombus Opisthokonta Metazoa Bilateria Protostomia Neoptera Endopterygota Apoidea Insecta Apinae Bombini Dicondylia Bombus hortorum Panarthropoda cellular organisms Megabombus Pancrustacea Mandibulata Ecdysozoa
170557 root Eukaryota Eumetazoa Arthropoda Hexapoda Phasmatodea Pterygota Opisthokonta Metazoa Bilateria Protostomia Neoptera Polyneoptera Insecta Timema Dicondylia Panarthropoda cellular organisms Timema poppensis Pancrustacea Mandibulata Timematoidea Timematidae Timematodea Ecdysozoa
589865 root Bacteria Proteobacteria Deltaproteobacteria delta/epsilon subdivisions cellular organisms Desulfobacterales Desulfobulbaceae Desulfurivibrio Desulfurivibrio alkaliphilus Desulfurivibrio alkaliphilus AHT 2
7955 root Eukaryota Eumetazoa Chordata Vertebrata Gnathostomata Actinopterygii Cypriniformes Danio Danio rerio Cyprinoidei Teleostei Ostariophysi Opisthokonta Metazoa Bilateria Deuterostomia Neopterygii Craniata Teleostomi Euteleostomi cellular organisms Actinopteri Clupeocephala Otophysi Cypriniphysae Otomorpha Osteoglossocephalai Danionidae Danioninae
As seen above, is it not always aligned (esp for Viruses, bacteria, some plants, nematodes, etc.) so it is hard to figure out what the families are when they aren't aligned properly. This is just a tiny sample of my files, I have millions of these lineages to go through.
Perhaps someone knows how to just extract family names from taxids using python ete3, which would be great. I've gone through all the commands and honestly I don't see a way to do that.
I tried this out but I don't get any output. Just a blank. I tried other taxids and same thing just returns an empty line. Why might that be I wonder?
There may be no family information (or other bits) for some of the taxID's so there is not much you can do about that.
When viewing the data before xtract, the info is all there I can see the Ranks and family does say Apidae. I think maybe just need to tweak the xtract part. Here is the tidbit:
I got it to work with my basic dirty skills since I don't know awk well enough:
TaxID's are present at various levels. My original command only works with root taxID's that you have in the example above.