Determining Phylum From .Faa Files?
2
1
Entering edit mode
13.2 years ago
Greg ▴ 50

I have a large number of .faa files that I want to be able to organize according to taxonomic group.

Is there a file that exists that will associate gi numbers in .faa files to which taxonomic group (phylum) they belong to?

taxonomy identifiers • 4.1k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Almost ... he appears to be asking for the larger taxonimic groups then just organism. Perhaps someone knows a way to go from TaxID to Phylum level information.

ADD REPLY
3
Entering edit mode
13.2 years ago
Neilfws 49k

Using the dumped taxonomy files in the NCBI FTP site is a good suggestion. You can also do this programmatically, with a little work.

Assuming that the *.faa file is from the NCBI, it should contain a standard header which includes the GI identifier. For example, the header for this fasta file looks like this:

>gi|298501435|ref|NC_014250.1| 'Nostoc azollae' 0708 plasmid pAzo02, complete sequence

You extract the GI (298501435) and use it for an EUtils ELink query, to find the Taxonomy ID for the sequence. For example:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=298501435"

This returns XML, which you can parse for the Taxonomy ID. The relevant part of the XML looks like this:

<LinkSetDb>
    <DbTo>taxonomy</DbTo>
    <LinkName>nuccore_taxonomy</LinkName>
    <Link>
        <Id>551115</Id>
    </Link>
</LinkSetDb>

Now you can use the Taxonomy ID (551115) in an EUtils EFetch query, to return the complete record in XML from the Entrez taxonomy database:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=551115&report=xml&mode=text"

Finally, you again need to parse this XML to find the phylum. The relevant part looks like this:

<LineageEx>
  ...
  <Taxon>
    <TaxId>1117</TaxId>
    <ScientificName>Cyanobacteria</ScientificName>
    <Rank>phylum</Rank>
  </Taxon>
  ...

Here is a quick and dirty script written in Ruby to illustrate. It uses some libraries: BioRuby to parse the fasta sequence and interact with NCBI, open-uri for the ELink query and Crack to parse the XML. Note that it has only been tested using the example fasta file mentioned previously and comes with no tests, exception handlers or guarantees.

#!/usr/bin/ruby

require "rubygems"
require "bio"
require "crack"
require "open-uri"

Bio::NCBI.default_email = "me@me.com"
ncbi  = Bio::NCBI::REST.new
fasta = "nostoc.faa"
ff    = Bio::FlatFile.open(Bio::FastaFormat, fasta)

while fe = ff.next_entry
  gi      = fe.gi
  phylum  = ""
  tax     = open("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=#{gi}").read
  tax     = Crack::XML.parse(tax)
  taxid   = tax['eLinkResult']['LinkSet']['LinkSetDb']['Link']['Id']
  taxdata = ncbi.efetch(taxid, {"report" => "xml", "db" => "taxonomy", "mode" => "text"})
  taxdata = Crack::XML.parse(taxdata)
  taxdata['TaxaSet']['Taxon']['LineageEx']['Taxon'].each do |t|
    if t['Rank'] == "phylum"
      phylum = t['ScientificName']
    end
  end
  puts "#{gi}\t#{phylum}"
end

Result: it prints the sequence GI and the phylum:

298501435       Cyanobacteria
ADD COMMENT
2
Entering edit mode
13.2 years ago
Will 4.6k

Looking at the NCBI FTP site there is a file taxdump_readme.txt which implies that nodes.dmp is the file that you probably need. It can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

So you can use the .faa files to determine the GI for each entry. Then use gi_taxid.dmp to determine the species-level classification. Then use the nodes.dmp file to determine which phylum each TaxID belongs to.

ADD COMMENT

Login before adding your answer.

Traffic: 1580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6