Have you thought about using the KEGG API? See the following links for more information:
Also, BioRuby seems to have a pretty good API implemented:
As does the R Bioconductor KEGGSOAP package:
The following (simple) Python script should work a treat for now though ;)
#!/usr/bin/env python
"""
Python script to retrieve KEGG gene entry for a number of different genes
Coded by Steve Moss (gawbul [at] gmail [dot] com
http://about.me/gawbul
"""
# import required modules
from SOAPpy import WSDL
# setup kegg wsdl
kegg_wsdl = 'http://soap.genome.jp/KEGG.wsdl'
kegg_service = WSDL.Proxy(kegg_wsdl)
# setup array of gene names
gene_names = ("ALDOA", "BHLHB3", "PKM2", "P4HA1", "EPO")
# iterate of gene_names and retrieve sequences
for gene_name in gene_names:
# use bfind first to find the list of genes for each query
# limit to hsa (homo sapiens)
gene_entries = kegg_service.bfind("genes " + gene_name + " hsa").rstrip("\n").split("\n") # returns str so split on \n, but remove last \n first
print "Found %d entries for %s" % (len(gene_entries), gene_name)
# iterate over gene_entries
for gene_entry in gene_entries:
# just use the first part of the string (e.g. hsa:226) to retrieve
# the sequences in fasta format (-f)
results = kegg_service.bget("-f " + gene_entry.split(" ")[0])
# print results to screen
print results
You could modify this to read the gene name entries from a file and feed them in that way, and perhaps also write the output to a file too, instead of displaying in STDOUT.
Essentially this uses the SOAP/WSDL framework to implement the equivalent of the HTTP URLs in a form readable by a computer (web service). You can build queries using the KEGG API just as you would a URL, e.g. the above "kegg_service.bget("-f hsa:" + gene_name)" is the same as calling http://www.genome.jp/dbget-bin/www_bget?-f+hsa:aldoa, except the data is returned in XML to the script, rather than HTML, as it would to the browser.
Hi, Steve. Thank you for providing so many resources. Problem solved.
No problem :) Glad to be of assistance!