Question

Forum:JGI's Taxonomy Server

2

Entering edit mode

8.2 years ago

Brian Bushnell 20k

JGI's taxonomy server is now public-facing! It translates gi numbers, taxid numbers, organism names, and accessions to either plaintext NCBI taxids or complete JSON-formatted lineage. The main page gives usage information. You can use it from the browser by entering something like:

http://taxonomy.jgi-psf.org/tax/gi/1234

or from the command line like this:

curl "http://taxonomy.jgi-psf.org/tax/name/homo sapiens"

Plaintext taxid is supported by entering "pt_" in front of the query format, like this:

http://taxonomy.jgi-psf.org/tax/pt_accession/NZ_AAAA01000057.1

Comma-delimited terms may be used for batch processing, and underscores may be substituted for spaces (in names). Please let me know of any features or formats that would be useful, or performance problems!

bbtools nt taxonomy bbmap ncbi • 3.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

On the main page you may want to add that example(s) shown /tax/name/ancestor/homo_sapiens,canis_lupus,bos_taurus are prefixed by https://taxonomy.jgi-psf.org to make a complete URL.

Converting the service over to https?

ADD REPLY • link 8.2 years ago by GenoMax 151k

0

Entering edit mode

https should work also - it works for me. Please let me know if it doesn't for you.

And yes, that's a good point. I'll modify the help message next time I reboot it. I started the current instance before it had a permanent URL, so I didn't know what it would be :)

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Confirming that https is working. You could set the site up to redirect plain requests over to https.

ADD REPLY • link 8.2 years ago by GenoMax 151k

0

Entering edit mode

I requested specifically that http be available, because it should be faster for creating new connections compared to https. The difference between 100ms and 200ms, if you're in another country, might be important if you are going to integrate this into your code for automatic queries. This is hard for me to measure though since I'm very close.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Multiple entries appear to work as well: https://taxonomy.jgi-psf.org/tax/gi/1234,123,2345568

ADD REPLY • link 8.2 years ago by GenoMax 151k

0

Entering edit mode

It's not completely clear why one would use this service. Perhaps you could clarify a few things. Is this getting data directly from NCBI through the taxonomy browser or the E-Utilities webservice? If so, why would you not want to use NCBI directly? Is the overall goal just to get JSON/text formats, or are there some novel utilities?

This is just a comment, but there are clearly some assumptions being made about the taxonomic levels because it reports incorrect results for plant species. For example, take a look at the full taxonomic lineage for sunflower. What is reported through this service as the family is the tribe. Resolving these issues in an abstract way will be virtually impossible given the number of subspecies and strains, so I would be cautious and provide disclaimers about what is reported, where it comes from, etc. so people know what they are getting.

ADD REPLY • link 8.2 years ago by SES 8.6k

1

Entering edit mode

The service is used internally at JGI to automatically look up lineages of organisms from NCBI sequence headers that don't contain taxids, just gi numbers or accessions, which are not very useful. Typically, we BLAST a set of sequences from an assembly and want to know if they are in the expected genus, or similar; or autogenerate a graph indicating what fraction of a metagenome assembly hits which phylum. If NCBI provides a service that allows you to easily send a a request from your code and receive the lineage of an organism in a useful format, that would be nice, but I'm not aware of one. I just ran a curl call to sunflower and did not find the results even slightly useful, but perhaps I'm doing it wrong...

But even in a browser, I don't find that result useful. It lists the lineage as:

cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; asterids; campanulids; Asterales; Asteraceae; Asteroideae; Heliantheae alliance; Heliantheae; Helianthus

So, if I am doing something bioinformatics-related, and I want to know, "What is the phylum of this BLAST hit?" ...it's not easy to figure that out from the information provided because there are an arbitrary number of unlabelled taxonomic levels. JGI's taxonomy server currently reduces the tree to this schema:

NO_RANK=0, SUBSPECIES=1, SPECIES=2, GENUS=3, FAMILY=4, ORDER=5, CLASS=6, PHYLUM=7, KINGDOM=8, DOMAIN=9, LIFE=10

Non-canonical levels such as tribe, varietas, forma, species subgroup, infraclass, parvorder, and so forth are promoted to the next-higher canonical level (such as tribe -> family) only when that lineage lacks said level, and otherwise deleted. But, it looks like there's a bug in the tree manipulation, so I'll fix that; thanks for bringing it to my attention! That tribe should not have been promoted to family since it was already a descendant of a family. I will have to reconsider node promotion a bit, and provide an alternate URL for the original unaltered tree.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

There is a well-documented NCBI webservice called E-utilities that already handles all of this for you. You form a standard request and parse the response with something like libxml. It is therefore trivial to get text output with a shell command or a script. There are also high-level APIs in numerous programming languages (available in BioPerl/BioPython, etc.) that are heavily used, so you don't usually have to even think about parsing the document or any low-level details. Getting BLAST taxon information or Pubmed IDs are very common use cases. It sounds like you may be trying to re-invent the wheel so to speak unless I'm missing something.

ADD REPLY • link 8.2 years ago by SES 8.6k

score 1 · Answer 1 · 2017-02-01

In response to SES's finding with regards to sunflower, I fixed a couple of bugs in the taxonomy server code. Also, to prevent confusion, the default behavior is now to report the full lineage of an organism rather than just the traditional levels (kingdom, phylum, etc). I also added the option to produce semicolon-delimited output with the "sc_" prefix. Semicolon mode:

curl taxonomy.jgi-psf.org/tax/sc_name/Helianthus_annuus
Eukaryota;Viridiplantae;Streptophyta;Streptophytina;Embryophyta;Tracheophyta;Euphyllophyta;Spermatophyta;Magnoliophyta;Mesangiospermae;eudicotyledons;Gunneridae;Pentapetalae;asterids;campanulids;Asterales;Asteraceae;Asteroideae;Heliantheae alliance;Heliantheae;Helianthus;Helianthus annuus

Plaintext mode:

curl taxonomy.jgi-psf.org/tax/pt_name/Helianthus_annuus
4232

json mode:

{"Helianthus_annuus": {
   "name": "Helianthus annuus",
   "tax_id": "4232",
   "level": "species",
   "species": {
      "name": "Helianthus annuus",
      "tax_id": "4232"
   },
   "genus": {
      "name": "Helianthus",
      "tax_id": "4231"
   },
   "tribe": {
      "name": "Heliantheae",
      "tax_id": "102814"
   },
   "no rank": {
      "name": "Heliantheae alliance",
      "tax_id": "911341"
   },
   "subfamily": {
      "name": "Asteroideae",
      "tax_id": "102804"
   },
   "family": {
      "name": "Asteraceae",
      "tax_id": "4210"
   },
   "order": {
      "name": "Asterales",
      "tax_id": "4209"
   },
   "no rank 2": {
      "name": "campanulids",
      "tax_id": "91882"
   },
   "subclass": {
      "name": "asterids",
      "tax_id": "71274"
   },
   "no rank 3": {
      "name": "Pentapetalae",
      "tax_id": "1437201"
   },
   "no rank 4": {
      "name": "Gunneridae",
      "tax_id": "91827"
   },
   "no rank 5": {
      "name": "eudicotyledons",
      "tax_id": "71240"
   },
   "no rank 6": {
      "name": "Mesangiospermae",
      "tax_id": "1437183"
   },
   "no rank 7": {
      "name": "Magnoliophyta",
      "tax_id": "3398"
   },
   "no rank 8": {
      "name": "Spermatophyta",
      "tax_id": "58024"
   },
   "no rank 9": {
      "name": "Euphyllophyta",
      "tax_id": "78536"
   },
   "no rank 10": {
      "name": "Tracheophyta",
      "tax_id": "58023"
   },
   "no rank 11": {
      "name": "Embryophyta",
      "tax_id": "3193"
   },
   "no rank 12": {
      "name": "Streptophytina",
      "tax_id": "131221"
   },
   "phylum": {
      "name": "Streptophyta",
      "tax_id": "35493"
   },
   "kingdom": {
      "name": "Viridiplantae",
      "tax_id": "33090"
   },
   "superkingdom": {
      "name": "Eukaryota",
      "tax_id": "2759"
   }
}}

Additionally, there is now "simple mode" that just displays ranked nodes at traditional levels, activated with "simpletax" instead of "tax":

curl taxonomy.jgi-psf.org/stax/name/Helianthus_annuus
{"Helianthus_annuus": {
   "name": "Helianthus annuus",
   "tax_id": "4232",
   "level": "species",
   "species": {
      "name": "Helianthus annuus",
      "tax_id": "4232"
   },
   "genus": {
      "name": "Helianthus",
      "tax_id": "4231"
   },
   "family": {
      "name": "Asteraceae",
      "tax_id": "4210"
   },
   "order": {
      "name": "Asterales",
      "tax_id": "4209"
   },
   "phylum": {
      "name": "Streptophyta",
      "tax_id": "35493"
   },
   "kingdom": {
      "name": "Viridiplantae",
      "tax_id": "33090"
   },
   "superkingdom": {
      "name": "Eukaryota",
      "tax_id": "2759"
   }
}}