I have a large number of .faa files that I want to be able to organize according to taxonomic group.
Is there a file that exists that will associate gi numbers in .faa files to which taxonomic group (phylum) they belong to?
I have a large number of .faa files that I want to be able to organize according to taxonomic group.
Is there a file that exists that will associate gi numbers in .faa files to which taxonomic group (phylum) they belong to?
Using the dumped taxonomy files in the NCBI FTP site is a good suggestion. You can also do this programmatically, with a little work.
Assuming that the *.faa file is from the NCBI, it should contain a standard header which includes the GI identifier. For example, the header for this fasta file looks like this:
>gi|298501435|ref|NC_014250.1| 'Nostoc azollae' 0708 plasmid pAzo02, complete sequence
You extract the GI (298501435) and use it for an EUtils ELink query, to find the Taxonomy ID for the sequence. For example:
curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=298501435"
This returns XML, which you can parse for the Taxonomy ID. The relevant part of the XML looks like this:
<LinkSetDb>
<DbTo>taxonomy</DbTo>
<LinkName>nuccore_taxonomy</LinkName>
<Link>
<Id>551115</Id>
</Link>
</LinkSetDb>
Now you can use the Taxonomy ID (551115) in an EUtils EFetch query, to return the complete record in XML from the Entrez taxonomy database:
curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=551115&report=xml&mode=text"
Finally, you again need to parse this XML to find the phylum. The relevant part looks like this:
<LineageEx>
...
<Taxon>
<TaxId>1117</TaxId>
<ScientificName>Cyanobacteria</ScientificName>
<Rank>phylum</Rank>
</Taxon>
...
Here is a quick and dirty script written in Ruby to illustrate. It uses some libraries: BioRuby to parse the fasta sequence and interact with NCBI, open-uri for the ELink query and Crack to parse the XML. Note that it has only been tested using the example fasta file mentioned previously and comes with no tests, exception handlers or guarantees.
#!/usr/bin/ruby
require "rubygems"
require "bio"
require "crack"
require "open-uri"
Bio::NCBI.default_email = "me@me.com"
ncbi = Bio::NCBI::REST.new
fasta = "nostoc.faa"
ff = Bio::FlatFile.open(Bio::FastaFormat, fasta)
while fe = ff.next_entry
gi = fe.gi
phylum = ""
tax = open("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=#{gi}").read
tax = Crack::XML.parse(tax)
taxid = tax['eLinkResult']['LinkSet']['LinkSetDb']['Link']['Id']
taxdata = ncbi.efetch(taxid, {"report" => "xml", "db" => "taxonomy", "mode" => "text"})
taxdata = Crack::XML.parse(taxdata)
taxdata['TaxaSet']['Taxon']['LineageEx']['Taxon'].each do |t|
if t['Rank'] == "phylum"
phylum = t['ScientificName']
end
end
puts "#{gi}\t#{phylum}"
end
Result: it prints the sequence GI and the phylum:
298501435 Cyanobacteria
Looking at the NCBI FTP site there is a file taxdump_readme.txt which implies that nodes.dmp is the file that you probably need. It can be downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/
So you can use the .faa files to determine the GI for each entry. Then use gi_taxid.dmp to determine the species-level classification. Then use the nodes.dmp file to determine which phylum each TaxID belongs to.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
duplicate of Automatically Getting The Ncbi Taxonomy Id From The Genbank Identifier ?
Almost ... he appears to be asking for the larger taxonimic groups then just organism. Perhaps someone knows a way to go from TaxID to Phylum level information.