Is There Any Way To Retrieve Genes' Sequences In Fasta Format Using The Kegg Orthology Code?
2
8
Entering edit mode
14.1 years ago
Luke ▴ 240

With KEGG it's possible to retrieve aa sequence of a protein ,correspondent to a gene, in FASTA format, using the following way:


Retrieve sequence entries in FASTA format:
http://www.genome.jp/dbget-bin/www_bget?-f+db1:entry1+db2:entry2+... http://www.genome.jp/dbget-bin/www_bget?-f+db+entry1+entry2+...
When the entry contains multiple sequences, specify as follows:
-f+-n+1 first sequence in FASTA format
-f+-n+2 second sequence in FASTA format
-f+-n+a amino acid sequence in FASTA format (KEGG GENES only)
-f+-n+n nucleotide sequence in FASTA format (KEGG GENES only)
(Examples)
http://www.genome.jp/dbget-bin/www_bget?-f+hsa:351
http://www.genome.jp/dbget-bin/www_bget?-f+-n+a+hsa:351
http://www.genome.jp/dbget-bin/www_bget?-f+-n+2+hsa:351

The list of options may be viewed by the -h option:
http://www.genome.jp/dbget-bin/www_bget?-h


This way has some limitations: it gives to you only one copy of a gene (if there are multiple copies of such gene) and it doesn't print any sequence if a gene is not marked by the searched gene name, as in the example:


BAU: BUAPTUC7_480(folD)
WBR: WGLp242(folD)
SGL: SG0706
ENT: Ent638_0986
ENC: ECL_01277
ESA: ESA_02756


where BAU, WBR, etc.. are the "kegg organism IDs" and BUAPTUC7_480, WGLp242, etc.. are the genes codes. As you can see SGL, ENT, ENC, ESA's orthologs of folD gene are not marked by "(folD)", and this fact limits the sequence retrieval.

In KEGG db each gene has also an orthology ID (K01491, in the following example)


K01491
folD; methylenetetrahydrofolate dehydrogenase (NADP+) / methenyltetrahydrofolate cyclohydrolase [EC:1.5.1.5 3.5.4.9]


IS THERE ANY WAY TO RETRIEVE GENE'S SEQUENCE IN FASTA FORMAT USING THE KEGG ORTHOLOGY CODE (K01491) instead the gene name (folD)?

Regards,
Luke

sequence retrieval database kegg fasta • 8.5k views
ADD COMMENT
7
Entering edit mode
14.1 years ago
Neilfws 49k

I don't know that there is an easy way via the web. However, if you're prepared to do a little programming that simulates a web query then yes, there is a way.

Most programming languages have a library (often called mechanize) which can automate web queries and parse the results. Here is how you could use the Ruby mechanize library:

#!/usr/bin/ruby

require "rubygems"
require "mechanize"

# fetch gene list for K01491
agent = Mechanize.new
page  = agent.get("http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+ko:K01491")
links = []

# get links to each gene page
page.links.each do |link|
  if link.uri.to_s =~ /dbget-bin/
    links << link
  end
end

# fetch and print out FASTA
links.each do |link|
  url   = "http://www.genome.jp/dbget-bin/www_bget?-f+-n+n+#{link.text}"
  fasta = agent.get(url)
  puts (fasta/"//pre").inner_text
end

There are 3 parts to this. The first fetches a page containing the gene results for the query K01491. The second looks for the links to each gene page and stores them in an array. Finally, the last section fetches the FASTA page for each gene, extracts the sequence from between the <pre> tags and prints it out.

In a "real" script you would want more careful checks at each stage, but that's the basic idea.

ADD COMMENT
0
Entering edit mode

Thank you! It's very interesting! But I don't know ruby. Is it possible to limit the search only to few taxa? (3-letters kegg org code, i.e. "bgr" for Bartonella grahamii)

ADD REPLY
0
Entering edit mode

Hi is there a way to get all the fasta protein sequences from one pathway.

I want to get all the sequences from a Nematostella:

nve04068 FoxO signaling pathway

Thanks

ADD REPLY
0
Entering edit mode

Hi there, is the example written in a specific programing language e.g. R, unix, python??? It is exactly what I need but unsure if I can use it in R or unix.

ADD REPLY
4
Entering edit mode
14.1 years ago
Skwsm ▴ 40

Alternative way to retrieve sequences included in a KEGG Orthology by using BioRuby.

#!/usr/bin/env ruby

require "bio"

orgs = %w[ tcr tbr ] # kegg organism codes of your interest

# create kegg api object
serv = Bio::KEGG::API.new

# create kegg orthology (ko) object
ko = Bio::KEGG::ORTHOLOGY.new(serv.bget("ko:K04283"))

# get kegg genes entry ids of the given kegg orthology
ko.genes.each do |ary|
  orgcode = ary[0]

  # select kegg genes entry ids of given organisms
  if orgs.include?(orgcode)
    ary[1].each do |entry_id|
      # retrieve sequences in fasta format
      puts serv.bget("-f -n 1 #{orgcode}:#{entry_id}")
    end
  end
end
ADD COMMENT
0
Entering edit mode

great! It's the solution I was looking for! How can I modify the script in order to print the output into a text file?

ADD REPLY

Login before adding your answer.

Traffic: 1625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6