Hello everyone,
I am trying to retrieve fasta-sequences from the KEGG database to prepare a customized database based on KO numbers. As most KO numbers have many linked sequences, I need an automated approach to copy the sequences in a combined fasta-file. For writing the Ruby-script I was following this post: Is There Any Way To Retrieve Genes' Sequences In Fasta Format Using The Kegg Orthology Code?
I adapted the code a bit (Ruby):
#!/usr/bin/ruby
require "rubygems"
require "mechanize"
# fetch gene list for K01505
agent = Mechanize.new
page = agent.get("http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+ko:K01505")
#puts page.title
links = []
This part is running, meaning that the code reaches the correct website. However, I am not sure if the links on the website are stored in an array with "links = []" correctly.
# get links to each gene page
page.links.each do |link|
if link.uri.to_s =~ /dbget-bin/
links << link
end
#page = link.click
#puts page.uri
end
Here, I am not 100% sure what the code does, but when I enable "page = link.click" and "puts page.uri" I get the url of each link pasted in my shell, which I thought shows that the script actually enters the links.
# fetch and print out FASTA
links.each do |link|
url = "http://www.genome.jp/dbget-bin/www_bget?-f+-n+n+#{link.text}"
fasta = agent.get(url)
puts (fasta/"//pre").inner_text
end
This part should create a fasta-file with the retrieved sequences. Although, I don't get an error message I also don't get the file. Do I need to add an output-folder or something else?
I hope someone can help me out. Thank you!
KEGG database bulk downloads require a subscription. While your script may work it may result in perma-ban for IP of user if KEGG folks detect the scraping.
My script uses TogoWS and does not access KEGG. This is what I meant. Please look at the code.
I am not familiar with Ruby or the tool you mention above. Just wanted to point out that any bulk downloads direct/indirect would likely be noticed by KEGG.
Thanks. Your point about caution in using KEGG is correct, but since We are only downloading a small portion of the data, I don't think it is a problem. TogoWS is a tool provided by the Database Center for Life Science. TogoWS caches KEGG, so it does not overload KEGG. Besides, frequent access to TogoWS automatically slows it down, so it is impossibleble to download large amounts of data.