Question

Extract Human Gene Sequences Based On Kegg Gene Name

1

Entering edit mode

12.7 years ago

Dejian ★ 1.3k

Hi, I have a list of human gene names from KEGG database, for example, ALDOA, BHLHB3, PKM2, P4HA1, EPO. I can get a list of genes with the same name in several species through searching the KEGG database, then click the one linking to the human gene hsa:226, and finally get the amino acid sequence and nucleotide sequence. Since there are hundreds of genes, this is apparently not efficient. I wonder whether there is a convenient way to finish this job. Many thanks!

kegg • 2.8k views

ADD COMMENT • link updated 12.7 years ago by Steve Moss 2.3k • written 12.7 years ago by Dejian ★ 1.3k

score 2 · Answer 1 · 2012-05-08

Have you thought about using the KEGG API? See the following links for more information:

Also, BioRuby seems to have a pretty good API implemented:

http://bioruby.org/rdoc/Bio/KEGG/API.html

As does the R Bioconductor KEGGSOAP package:

http://www.bioconductor.org/packages/2.10/bioc/html/KEGGSOAP.html

The following (simple) Python script should work a treat for now though ;)

#!/usr/bin/env python
"""
Python script to retrieve KEGG gene entry for a number of different genes
Coded by Steve Moss (gawbul [at] gmail [dot] com
http://about.me/gawbul
"""

# import required modules
from SOAPpy import WSDL

# setup kegg wsdl
kegg_wsdl = 'http://soap.genome.jp/KEGG.wsdl'
kegg_service = WSDL.Proxy(kegg_wsdl)

# setup array of gene names
gene_names = ("ALDOA", "BHLHB3", "PKM2", "P4HA1", "EPO")

# iterate of gene_names and retrieve sequences
for gene_name in gene_names:
    # use bfind first to find the list of genes for each query
    # limit to hsa (homo sapiens)
    gene_entries = kegg_service.bfind("genes " + gene_name + " hsa").rstrip("\n").split("\n") # returns str so split on \n, but remove last \n first
    print "Found %d entries for %s" % (len(gene_entries), gene_name)

    # iterate over gene_entries
    for gene_entry in gene_entries:
        # just use the first part of the string (e.g. hsa:226) to retrieve
        # the sequences in fasta format (-f)
        results = kegg_service.bget("-f " + gene_entry.split(" ")[0])
        # print results to screen
        print results

You could modify this to read the gene name entries from a file and feed them in that way, and perhaps also write the output to a file too, instead of displaying in STDOUT.

Essentially this uses the SOAP/WSDL framework to implement the equivalent of the HTTP URLs in a form readable by a computer (web service). You can build queries using the KEGG API just as you would a URL, e.g. the above "kegg_service.bget("-f hsa:" + gene_name)" is the same as calling http://www.genome.jp/dbget-bin/www_bget?-f+hsa:aldoa, except the data is returned in XML to the script, rather than HTML, as it would to the browser.