Question

Determining Phylum From .Faa Files?

1

Entering edit mode

13.3 years ago

Greg ▴ 50

I have a large number of .faa files that I want to be able to organize according to taxonomic group.

Is there a file that exists that will associate gi numbers in .faa files to which taxonomic group (phylum) they belong to?

taxonomy identifiers • 4.1k views

ADD COMMENT • link updated 13.3 years ago by Neilfws 49k • written 13.3 years ago by Greg ▴ 50

0

Entering edit mode

duplicate of Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier ?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 13.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Almost ... he appears to be asking for the larger taxonimic groups then just organism. Perhaps someone knows a way to go from TaxID to Phylum level information.

ADD REPLY • link 13.3 years ago by Will 4.6k

Ram · Answer 1 · 2011-08-30

Using the dumped taxonomy files in the NCBI FTP site is a good suggestion. You can also do this programmatically, with a little work.

Assuming that the *.faa file is from the NCBI, it should contain a standard header which includes the GI identifier. For example, the header for this fasta file looks like this:

>gi|298501435|ref|NC_014250.1| 'Nostoc azollae' 0708 plasmid pAzo02, complete sequence

You extract the GI (298501435) and use it for an EUtils ELink query, to find the Taxonomy ID for the sequence. For example:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=298501435"

This returns XML, which you can parse for the Taxonomy ID. The relevant part of the XML looks like this:

<LinkSetDb>
    <DbTo>taxonomy</DbTo>
    <LinkName>nuccore_taxonomy</LinkName>
    <Link>
        <Id>551115</Id>
    </Link>
</LinkSetDb>

Now you can use the Taxonomy ID (551115) in an EUtils EFetch query, to return the complete record in XML from the Entrez taxonomy database:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=551115&report=xml&mode=text"

Finally, you again need to parse this XML to find the phylum. The relevant part looks like this:

<LineageEx>
  ...
  <Taxon>
    <TaxId>1117</TaxId>
    <ScientificName>Cyanobacteria</ScientificName>
    <Rank>phylum</Rank>
  </Taxon>
  ...

Here is a quick and dirty script written in Ruby to illustrate. It uses some libraries: BioRuby to parse the fasta sequence and interact with NCBI, open-uri for the ELink query and Crack to parse the XML. Note that it has only been tested using the example fasta file mentioned previously and comes with no tests, exception handlers or guarantees.

#!/usr/bin/ruby

require "rubygems"
require "bio"
require "crack"
require "open-uri"

Bio::NCBI.default_email = "me@me.com"
ncbi  = Bio::NCBI::REST.new
fasta = "nostoc.faa"
ff    = Bio::FlatFile.open(Bio::FastaFormat, fasta)

while fe = ff.next_entry
  gi      = fe.gi
  phylum  = ""
  tax     = open("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=#{gi}").read
  tax     = Crack::XML.parse(tax)
  taxid   = tax['eLinkResult']['LinkSet']['LinkSetDb']['Link']['Id']
  taxdata = ncbi.efetch(taxid, {"report" => "xml", "db" => "taxonomy", "mode" => "text"})
  taxdata = Crack::XML.parse(taxdata)
  taxdata['TaxaSet']['Taxon']['LineageEx']['Taxon'].each do |t|
    if t['Rank'] == "phylum"
      phylum = t['ScientificName']
    end
  end
  puts "#{gi}\t#{phylum}"
end

Result: it prints the sequence GI and the phylum:

298501435       Cyanobacteria

score 2 · Answer 2 · 2011-08-29

Looking at the NCBI FTP site there is a file taxdump_readme.txt which implies that nodes.dmp is the file that you probably need. It can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

So you can use the .faa files to determine the GI for each entry. Then use gi_taxid.dmp to determine the species-level classification. Then use the nodes.dmp file to determine which phylum each TaxID belongs to.