How to obtain taxonomic information from GIs
2
0
Entering edit mode
8.4 years ago
bpz ▴ 60

Hi there

I think this might be an easy one, but I can't find a simple and fast way to do it. I have a fasta file with +1000 gis (I am constructing a phylogeny with these protein sequences), and I need to obtain the taxonomic information for each of them. Any ideas on how can I do this?

Thanks in advance.

sequence gis taxonomy • 4.0k views
ADD COMMENT
0
Entering edit mode

No guarantee that this will work: Parsing Ncbi Taxonomic Tree?

Please be aware that gi numbers will be going away starting in September 2016.

ADD REPLY
1
Entering edit mode
8.4 years ago

The BBMap package has a couple tools for doing this, described in bbmap/docs/guides/TaxonomyGuide.txt.

Essentially, after downloading the NCBI taxonomy files (described in the guide), they need to be converted to an efficient format using taxtree.sh and gitable.sh. Then can be used with taxonomy.sh. In your case, you would do this:

1) Download and unzip ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
2) Download ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
     (note: for nucleotides this would be ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz)
3) Convert them:
taxtree.sh names.dmp nodes.dmp tree.taxtree.gz
gitable.sh gi_taxid_prot.dmp.gz gitable.int1d.gz
4) Grab the headers from your fasta:
reformat.sh in=file.fasta out=names.header
5) Look at the taxonomy:
taxonomy.sh config=names.header table=gitable.int1d.gz tree=tree.taxtree.gz -da -Xmx1g

This will print one block per header which contains a gi number. Each block will look like this:

species 9606    Homo sapiens
genus   9605    Homo
family  9604    Hominidae
order   9526    Catarrhini
class   314146  Euarchontoglires
phylum  7711    Chordata
kingdom 33208   Metazoa
domain  2759    Eukaryota

...where the number signifies the NCBI taxID.

ADD COMMENT
0
Entering edit mode

Hi Brian

Thanks a lot! , It is working alright... except the last bit, I can't find taxonomy.sh in my bbmap folder :p

ADD REPLY
0
Entering edit mode

That's odd. What version do you have? I only added it fairly recently (in May), but I just downloaded the latest version (36.20) and verified that it is indeed present...

ADD REPLY
0
Entering edit mode

Ahh, yep, I think I have an older version. I am going to download the new one ASAP

ADD REPLY
0
Entering edit mode

I ran it, but I can't find the output. I got this message:

Loading tree.
Exception in thread "main" java.lang.RuntimeException: Can't find file /global/projectb/sandbox/gaag/bbtools/tax/tree.taxtree.gz
    at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:815)
    at fileIO.ReadWrite.getGZipInputStream(ReadWrite.java:908)
    at fileIO.ReadWrite.getInputStream(ReadWrite.java:774)
    at fileIO.ReadWrite.readObject(ReadWrite.java:742)
    at fileIO.ReadWrite.read(ReadWrite.java:1090)
    at tax.PrintTaxonomy.<init>(PrintTaxonomy.java:137)
    at tax.PrintTaxonomy.main(PrintTaxonomy.java:40)
ADD REPLY
0
Entering edit mode

Hmm, can you show me the command line? By default it is supposed to look in that hard-coded location only if you did not specify a path to the taxtree file you generated in step 3.

ADD REPLY
0
Entering edit mode

OK, is this one:

./taxonomy.sh config=names.header table=gitable.int1d.gz tree=tree.taxtree.gz -da -Xmx1g
ADD REPLY
0
Entering edit mode

Ah, sorry, looks like there's a bug. "config=names.header" is overwriting your other flags. I will fix that. For now, can you edit the config file to add these two lines:

table=gitable.int1d.gz
tree=tree.taxtree.gz

Then run the same command again. Please let me know if that works!

ADD REPLY
0
Entering edit mode

Sorry, I can't find the config file. Where is it exactly?, do I add the lines at the end?

ADD REPLY
0
Entering edit mode

By "config file" I mean "names.header", sorry :) And yes, just add those lines at the beginning or end.

ADD REPLY
0
Entering edit mode

Hello again

It didn't work unfortunately, the error message is:

Loading gi table.
Loading tree.
Exception in thread "main" java.lang.RuntimeException: java.io.InvalidClassException: tax.TaxTree; local class incompatible: stream classdesc serialVersionUID = -6216551978806194900, local class serialVersionUID = 1682832560435175041
    at fileIO.ReadWrite.readObject(ReadWrite.java:751)
    at fileIO.ReadWrite.read(ReadWrite.java:1090)
    at tax.PrintTaxonomy.<init>(PrintTaxonomy.java:137)
    at tax.PrintTaxonomy.main(PrintTaxonomy.java:40)
Caused by: java.io.InvalidClassException: tax.TaxTree; local class incompatible: stream classdesc serialVersionUID = -6216551978806194900, local class serialVersionUID = 1682832560435175041
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
    at fileIO.ReadWrite.readObject(ReadWrite.java:747)
    ... 3 more
ADD REPLY
0
Entering edit mode

Sorry for all of the complication. That's because you processed the initial files with an older version of BBMap and I changed the format of the taxtree file in the newer version; it will need to be regenerated:

taxtree.sh names.dmp nodes.dmp tree.taxtree.gz
ADD REPLY
0
Entering edit mode

Hello again

Sorry for the inconvenience. Now when I run "taxonomy.sh" I get this message:

Loading tree.
Exception in thread "main" java.lang.RuntimeException: Can't find file /global/projectb/sandbox/gaag/bbtools/tax/tree.taxtree.gz
    at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:815)
    at fileIO.ReadWrite.getGZipInputStream(ReadWrite.java:908)
    at fileIO.ReadWrite.getInputStream(ReadWrite.java:774)
    at fileIO.ReadWrite.readObject(ReadWrite.java:742)
    at fileIO.ReadWrite.read(ReadWrite.java:1090)
    at tax.PrintTaxonomy.<init>(PrintTaxonomy.java:137)
    at tax.PrintTaxonomy.main(PrintTaxonomy.java:40)
ADD REPLY
0
Entering edit mode
Can't find file /global/projectb/sandbox/gaag/bbtools/tax/tree.taxtree.gz

That file is not in that directory. Provide an alternate path/dir as needed.

ADD REPLY
0
Entering edit mode

In this case, that's the default directory, harded-coded so that people at JGI won't have to add it, and it's not supposed to be used elsewhere, but apparently there's a bug. I'll fix it very soon; possibly tomorrow. Did you get this exception even though you put the lines

table=gitable.int1d.gz
tree=tree.taxtree.gz

in the names.header file?

ADD REPLY
0
Entering edit mode

Yep, I added the lines and still got that exception.

ADD REPLY
0
Entering edit mode

OK! Sorry about that, I will post a fixed version tomorrow.

ADD REPLY
0
Entering edit mode

Hi, Was there a solution to this? My attempts also ends here: Can't find file /global/projectb/sandbox/gaag/bbtools/tax/tree.taxtree.gz

I am using version 37.78 Best, Erik

ADD REPLY
0
Entering edit mode
8.4 years ago
natasha.sernova ★ 4.0k

I would recommend two step-way:

1a)Convert your gi numbers to accession numbers:

With this perl-script below:

http://bioinformatics.cvr.ac.uk/blog/convert-ncbi-protein-gi-to-genome-accession/

or differently:

1b)From this site

http://www.ncbi.nlm.nih.gov/books/NBK25501/

go here

http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_1_Converting_GI_num

and then follow this post (second step for any a) or b) starts):

Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier

BTW, Neilfws recommended a direct way from Gi to taxID, so there are many possible choices.

ADD COMMENT

Login before adding your answer.

Traffic: 1885 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6