Question

Is There Any Way To Retrieve Genes' Sequences In Fasta Format Using The Kegg Orthology Code?

8

Entering edit mode

14.8 years ago

Luke ▴ 240

With KEGG it's possible to retrieve aa sequence of a protein ,correspondent to a gene, in FASTA format, using the following way:

Retrieve sequence entries in FASTA format:
http://www.genome.jp/dbget-bin/www_bget?-f+db1:entry1+db2:entry2+... http://www.genome.jp/dbget-bin/www_bget?-f+db+entry1+entry2+...
When the entry contains multiple sequences, specify as follows:
-f+-n+1 first sequence in FASTA format
-f+-n+2 second sequence in FASTA format
-f+-n+a amino acid sequence in FASTA format (KEGG GENES only)
-f+-n+n nucleotide sequence in FASTA format (KEGG GENES only)
(Examples)
http://www.genome.jp/dbget-bin/www_bget?-f+hsa:351
http://www.genome.jp/dbget-bin/www_bget?-f+-n+a+hsa:351
http://www.genome.jp/dbget-bin/www_bget?-f+-n+2+hsa:351

The list of options may be viewed by the -h option:
http://www.genome.jp/dbget-bin/www_bget?-h

This way has some limitations: it gives to you only one copy of a gene (if there are multiple copies of such gene) and it doesn't print any sequence if a gene is not marked by the searched gene name, as in the example:

BAU: BUAPTUC7_480(folD)
WBR: WGLp242(folD)
SGL: SG0706
ENT: Ent638_0986
ENC: ECL_01277
ESA: ESA_02756

where BAU, WBR, etc.. are the "kegg organism IDs" and BUAPTUC7_480, WGLp242, etc.. are the genes codes. As you can see SGL, ENT, ENC, ESA's orthologs of folD gene are not marked by "(folD)", and this fact limits the sequence retrieval.

In KEGG db each gene has also an orthology ID (K01491, in the following example)

K01491
folD; methylenetetrahydrofolate dehydrogenase (NADP+) / methenyltetrahydrofolate cyclohydrolase [EC:1.5.1.5 3.5.4.9]

IS THERE ANY WAY TO RETRIEVE GENE'S SEQUENCE IN FASTA FORMAT USING THE KEGG ORTHOLOGY CODE (K01491) instead the gene name (folD)?

Regards,
Luke

sequence retrieval database kegg fasta • 9.3k views

ADD COMMENT • link updated 14.8 years ago by Skwsm ▴ 40 • written 14.8 years ago by Luke ▴ 240

Ram · Answer 1 · 2010-10-26

7

Entering edit mode

14.8 years ago

Neilfws 49k

I don't know that there is an easy way via the web. However, if you're prepared to do a little programming that simulates a web query then yes, there is a way.

Most programming languages have a library (often called mechanize) which can automate web queries and parse the results. Here is how you could use the Ruby mechanize library:

#!/usr/bin/ruby

require "rubygems"
require "mechanize"

# fetch gene list for K01491
agent = Mechanize.new
page  = agent.get("http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+ko:K01491")
links = []

# get links to each gene page
page.links.each do |link|
  if link.uri.to_s =~ /dbget-bin/
    links << link
  end
end

# fetch and print out FASTA
links.each do |link|
  url   = "http://www.genome.jp/dbget-bin/www_bget?-f+-n+n+#{link.text}"
  fasta = agent.get(url)
  puts (fasta/"//pre").inner_text
end

There are 3 parts to this. The first fetches a page containing the gene results for the query K01491. The second looks for the links to each gene page and stores them in an array. Finally, the last section fetches the FASTA page for each gene, extracts the sequence from between the <pre> tags and prints it out.

In a "real" script you would want more careful checks at each stage, but that's the basic idea.

ADD COMMENT • link updated 6.0 years ago by Ram 45k • written 14.8 years ago by Neilfws 49k

0

Entering edit mode

Thank you! It's very interesting! But I don't know ruby. Is it possible to limit the search only to few taxa? (3-letters kegg org code, i.e. "bgr" for Bartonella grahamii)

ADD REPLY • link 14.8 years ago by Luke ▴ 240

0

Entering edit mode

Hi is there a way to get all the fasta protein sequences from one pathway.

I want to get all the sequences from a Nematostella:

nve04068 FoxO signaling pathway

Thanks

ADD REPLY • link updated 6.0 years ago by Ram 45k • written 10.4 years ago by catagui ▴ 40

0

Entering edit mode

Hi there, is the example written in a specific programing language e.g. R, unix, python??? It is exactly what I need but unsure if I can use it in R or unix.

ADD REPLY • link 7.5 years ago by joel.white332 ▴ 10

Ram · Answer 2 · 2010-10-27

4

Entering edit mode

14.8 years ago

Skwsm ▴ 40

Alternative way to retrieve sequences included in a KEGG Orthology by using BioRuby.

#!/usr/bin/env ruby

require "bio"

orgs = %w[ tcr tbr ] # kegg organism codes of your interest

# create kegg api object
serv = Bio::KEGG::API.new

# create kegg orthology (ko) object
ko = Bio::KEGG::ORTHOLOGY.new(serv.bget("ko:K04283"))

# get kegg genes entry ids of the given kegg orthology
ko.genes.each do |ary|
  orgcode = ary[0]

  # select kegg genes entry ids of given organisms
  if orgs.include?(orgcode)
    ary[1].each do |entry_id|
      # retrieve sequences in fasta format
      puts serv.bget("-f -n 1 #{orgcode}:#{entry_id}")
    end
  end
end

ADD COMMENT • link updated 6.0 years ago by Ram 45k • written 14.8 years ago by Skwsm ▴ 40

0

Entering edit mode

great! It's the solution I was looking for! How can I modify the script in order to print the output into a text file?

ADD REPLY • link 14.8 years ago by Luke ▴ 240