I'm trying to create a fasta file with all the viral sequences for a particular gene, with taxonomy information in the record description. So far so good, except that while I can see the general host information on the taxonomy page of each virus (For example this virus: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1221449 has "Host: plants" as part of its entry) that information is not part of the taxonomy database information I get when I do an efetch query using the taxonomy db id number. And I really want that host information! It's right there, taunting me. If anyone knows how to get at it, I'd really appreciate it.
Here is my query, in case it matters:
handle2 = Entrez.efetch(db="Taxonomy", id=taxid, retmode="xml")
Edit:
Based on what Neilfws wrote, I wrote up some python to scrape the ncbi taxonomy browser for virus host name, for Ruby is Greek to me. Here it is for any other poor saps who need to do this. Depending on the tax uid (and, one presumes, how frisky a PI was feeling when they entered in their sequence), the taxonomy browser sometimes takes you to a list of species links rather than the taxonomy entry, so this code accounts for that....usually.
from bs4 import BeautifulSoup as BS
from urllib2 import urlopen
import re
for tax_id in listoftaxids:
address = 'http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id='+tax_id
page = urlopen(address)
soup = BS(page)
find_string = soup.body.form.find_all('td')
find = 0
for I in find_string:
for match in re.findall('Host:\s'+r'<\/em>'+'(.*?)'+r'<', str(i)):
print match
find += 1
if find == 0:
spec_link = soup.body.form.find_all('a', attrs={'title' : 'species'})
for I in spec_link:
newaddress = 'http://www.ncbi.nlm.nih.gov'+i.get('href')
newpage = urlopen(newaddress)
soup1 = BS(newpage)
find_string = soup1.body.form.find_all('td')
for I in find_string:
for match in re.findall('Host:\s'+r'<\/em>'+'(.*?)'+r'<', str(i)):
print match
find += 1
if find == 0:
print 'SERIOUSLY???'
see also: Finding Main Virus Hosts From The Name Of The Virus