How To Query Hgmd Database In Batch
1
1
Entering edit mode
13.6 years ago
User 4397 ▴ 10

I am trying to query the hgmd (http://www.hgmd.org/) with 9,000 genes with the public access.

Due to the number of the genes, I write a Python script to do it in batch instead of query the genes one after one. The script works at first for the first few thousands of the genes, but it stops working properly. I try to log in manually then, and get the following msg,

We are sorry, but (my email address) has not been recognised as belonging to a registered user. Please check that you have entered your details correctly and remember that access is granted to registered academic or non-profit HGMD users only.

It seems that the server blocks the user from querying the database in batch. I wonder whether there is another way to do it. Thanks.

• 5.9k views
ADD COMMENT
0
Entering edit mode

I would suggest that you write to the HGMD folks. My guess is that you got yourself banned for querying their website in such a programmatic manner.

ADD REPLY
3
Entering edit mode
13.6 years ago

You need to initialize a browser object and then make the login.

I wrote some example code at https://bitbucket.org/dalloliogm/query-hgmd/src/tip/query_hgmd.py

#!/usr/bin/env python
"""
"""
import mechanize
import cookielib
#import html2text
import re
import time
import sys

hgmd_login_url = "http://www.hgmd.cf.ac.uk/docs/login.html"
email_address = ""
password = ""
if email_address == "":
    sys.exit("define your email address and password")



def initialize_browser():
    """
    Initialize a Browser object

    thanks to <http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/>
    """

    br = mechanize.Browser()
    # Cookie Jar
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)

    # Browser options
    br.set_handle_equiv(True)
#    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)

    # Follows refresh 0 but not hangs on refresh > 0
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

    # Want debugging messages?
#    br.set_debug_http(True)
#    br.set_debug_redirects(True)
#    br.set_debug_responses(True)

    # User-Agent (this is cheating, ok?)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    br.addheaders.append(('email', email_address))

    return br


def login_hgmd(br):
    """
    login to HGMD

    After calling this function, you will be able to search in HGMD programmatically
    """
    response = br.open(hgmd_login_url)
    html = response.read()

    # print response to STDOUT for debugging purposes
    # the html2text library is used for formatting the output in a more readable form
    print html

    # print all the forms in the current page
    print [f for f in br.forms()]

    # select login form
    br.select_form(nr=0)
    print br.form

    # print all controls in the current form, for debugging purposes
    print [c.name for c in br.form.controls]

    # set username and password
    br.form['email'] = email_address
    br.form['password'] = password

    # submit form
    response_form = br.submit()

    # Now, you should have successfully logged in. The contents of the page will be changed. Check the contents of br.read()
    html_response = response_form.read()
    print html_response

    # Then, you should complete this on your own. I suggest you to br.open("http://www.hgmd.cf.ac.uk/ac/index.php"), select the Search form, and submit a query again

    # wait 2 seconds to not overload the server
    time.sleep(2)

    return br


#def pretty_print_page(br):
#    print html2text.html2text(br.response().read())



if __name__ == '__main__':
    br = initialize_browser()
#    resp = browse_dbcline(br, genes = ['GCS1'])
    br = login_hgmd(br)

You can get the mechanize and the html2text libraries from PyPI. HTML2Text is not needed but it will be useful for debugging.

ADD COMMENT
0
Entering edit mode

Thanks. I notice that you use "time.sleep(2)" to prevent the script from overloading the server. I will include it in my own script and give it a try. I hope the server won't find out that I am querying thousands of genes in the database.

ADD REPLY
0
Entering edit mode

Also, check that you are including your email in the headers.

ADD REPLY

Login before adding your answer.

Traffic: 2324 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6