Question

Find Amino Acid Change For Snp Using Eutils

1

Entering edit mode

13.3 years ago

Dpsguy ▴ 140

Hi…I am just starting out with exploring Entrez Eutils using Biopython. What I need to do is find the amino acid change for a list of rsIDs of missense SNPs. I cannot figure out how to do that. I guess the answer would lie in the xml generated by this query:

handle = Entrez.efetch(db="snp", id="6046", retmode="xml")

But when I try

record = Entrez.read(handle)

It gives me an error like: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces.

I don’t know why this is happening. Maybe I am missing something obvious here…

Is it even possible to get my required information using eutils? If not, can you suggest any other means (except doing it manually for every SNP)?

Thanks in advance.

eutils biopython snp dbsnp • 6.8k views

ADD COMMENT • link updated 11.6 years ago by Daniel E Cook ▴ 280 • written 13.3 years ago by Dpsguy ▴ 140

score 2 · Answer 1 · 2011-09-19

2

Entering edit mode

13.3 years ago

Martijn Vermaat ▴ 190

This works for me:

response = Entrez.efetch(db='SNP', id='6046', rettype='flt', retmode='xml')
minidom.parseString(response.read())

ADD COMMENT • link 13.3 years ago by Martijn Vermaat ▴ 190

1

Entering edit mode

There is possibly more than one amino acid change associated with the SNP, but you can get the annotated ones from your response by looking in the RsStruct elements (or from the HGVS descriptions on NP references in the hgvs elements). E.g. calling .getElementsByTagName('hgvs') on the parsed document could be the first step. Consult some general documentation on XML DOM navigation if you need more information.

ADD REPLY • link 13.2 years ago by Martijn Vermaat ▴ 190

0

Entering edit mode

Thanks for the tip! Seems like etree can also do the job. But then back to my original question: how do I get the amino acid change from this xml? I am not very familiar with xml and was relying on the Entrez parser to do the job for me. I have no experience with etree or minidom

ADD REPLY • link 13.3 years ago by Dpsguy ▴ 140

Ram · Answer 2 · 2011-09-17

1

Entering edit mode

13.3 years ago

Peter 6.0k

Which version of Biopython do you have? Mine is the latest and it says:

NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces

You can try another Python XML parser instead. For some reason the NCBI give very different XML back for the SNP database than all their other databases, and the Bio.Entrez parser can't cope: https://redmine.open-bio.org/issues/2771

Interestingly you can try putting http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=6046&retmode=xml into validators like http://www.validome.org/xml/validate/ (says it might be OK) or http://validator.w3.org/ which says its invalid.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.3 years ago by Peter 6.0k

0

Entering edit mode

I don’t think using another parser would help. From http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html :

eFetch utility generates an invalid XML for SNP, so currently it doesn't work through SOAP. The bug is being fixed.

This page seems to have been last updated in 2009, though. Too long a time to get a bug fixed.

So what other options do I have?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 13.3 years ago by Dpsguy ▴ 140

0

Entering edit mode

First of all tell the NCBI about this, it will help them to rank priorities if they know how many people are having trouble with this. Also check out what other formats they offer for the SNP database...

ADD REPLY • link 13.3 years ago by Peter 6.0k

0

Entering edit mode

I wrote to NCBI and the reply was: "SNP data is also available through SOAP web service, which requires this snp specific efetch wsdl:http://eutils.ncbi.nlm.nih.gov/soap/v2.0/efetch_snp.wsdl How the XML object is requested and parsed by the bio.python is more a question for its developers since we do not have resources to trouble shoot third party software."

The best direct query according to them is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=6046&rettype=xml&retmode=text

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 13.3 years ago by Dpsguy ▴ 140

0

Entering edit mode

@ Peter: Yes you are right...it does gave the error that you have mentioned. I have edited my question accordingly.

ADD REPLY • link 13.3 years ago by Dpsguy ▴ 140

0

Entering edit mode

But all this talk about invalid xml and parsers does nothing to answer my original question that is in the title. Now that I have the parsed xml using minidom (see below), how do I use that to get the amino acid change for a mutation?

ADD REPLY • link 13.3 years ago by Dpsguy ▴ 140

Ram · Answer 3 · 2013-05-17

1

Entering edit mode

11.6 years ago

Daniel E Cook ▴ 280

I wrote a function to parse the data from flat files. This is a work in progress, but maybe this can be of some help to someone:

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 11.6 years ago by Daniel E Cook ▴ 280

Ram · Answer 4 · 2011-10-09

I guess I found a workable solution using the hints provided by Martijn Vermaat. I reproduce my code below:

flag = 0
rsid = '6046'
res = minidom.parseString(Entrez.efetch(db='snp', id=rsid, retmode='xml').read())
nodes = res.getElementsByTagName('hgvs') 
for node in nodes:
    if 'NP_' in node.firstChild.nodeValue:
        flag = 1
        val = node.firstChild.nodeValue
        regex1 = r'[A-Z][a-z]+'
        regex2 = r'[0-9]+'
        aa = re.findall(regex1, val)
        pos = re.findall(regex2, val)
        print aa[0] + " > " + aa[1] + " Position: " + pos[2]
if flag == 0:
    print "SNP not in coding region"

The output is the following:

Arg > Gln Position: 413
Arg > Leu Position: 413
Arg > Pro Position: 413
Arg > Gln Position: 391
Arg > Leu Position: 391
Arg > Pro Position: 391

If anyone can provide a better method or code, your suggestions are most welcome.