Accessing Geoprofiles Data Via Entrez
2
1
Entering edit mode
12.0 years ago
viraptor ▴ 10

How can I get the data from geoprofiles database parsed into some sane way? For example after the search, I get a result with a couple of ids. Let's say I want to download 64663643 (http://www.ncbi.nlm.nih.gov/geoprofiles/64663643). Specifically I'd like to get the GDS's summary from it.

But after doing the standard:

Bio.Entrez.read(Bio.Entrez.esummary(db='geoprofiles', id='64663643'))

I get DTD errors (missing tag ENTREZ_GENE_ID). If I try without validation, I get a lot of data without a proper structure:

{u'DocumentSummarySet': ListElement([ListElement(['3682', '2896', 'fKTC', 'zFJA', 'Thiamine supplementation effect on non-insulin-dependent diabetes model: liver', 'Rattus norvegicus', '476602p1p1p1', 'Expression profiling by array', 'count', '46103', 'Gja7', 'gap junction membrane channel protein alpha 7', '', '', 'Rattus norvegicus gap junction channel protein connexin 45 mRNA, partial cds', 'AF536559.4', '', '', '', '', '476602p1;476604p1', '9;9', '5.231620', '346.305270', '', '22500', '0', '88', '30'], attributes={u'uid': u'64663643'})], attributes={u'status': u'OK'})}

What should I do differently to get a proper parsed result?

biopython entrez • 2.9k views
ADD COMMENT
0
Entering edit mode

I don't have an answer to your question, but may I ask: if you want to retrieve data from a GDS, why don't you use the GDS ID (GDS3682) ? And maybe you may find this website and related SQLite DB useful : http://gbnci.abcc.ncifcrf.gov/geo/index.php. Julien

ADD REPLY
0
Entering edit mode

I'm going to do that, but first I need to get the GDS id from the geoprofiles entry.

ADD REPLY
0
Entering edit mode
12.0 years ago
Peter 6.0k

Regarding the missing tag error message, if you look at the raw XML you'll see the ENTREZ GENE ID is actually missing - apparently that is counter to what the XML file's DTD says to expect (and if so that is an NCBI bug):

>>> from Bio import Entrez
>>> print Entrez.esummary(db='geoprofiles', id='64663643').read()
...
<ENTREZ_GENE_ID></ENTREZ_GENE_ID>
....

What exactly are you trying to get from the XML? You might prefer to use one of the Python XML parsing libraries directly, e.g. ElementTree, which doesn't depend on the DTD file and the XML file actually following it.

ADD COMMENT
0
Entering edit mode
12.0 years ago
Chris Maloney ▴ 360

I am not familiar with Biopython, but you can see the raw XML results from ESummary here (just open in your browser): http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?retmode=xml&version=2.0&db=geoprofiles&id=64663643.

Reverse engineering a bit, it looks like biopython is turning the <DocumentSummary> element into a "ListElement", and then giving an array of string values under that, one for each child element. I think that the geoprofiles esummary always returns every possible child element, so you can access the individual data elements by their numeric position. For example, geneDesc would be string # 11 (zero-based), 'gap junction membrane channel protein alpha 7'.

You could also fix the DTD by adding the requisite elements that are missing. Here's a diff between the existing DTD on the NCBI site and a fixed one:

$ diff eSummary_geoprofiles.dtd eSummary_geoprofiles.fixed.dtd 
23a24
> <!ELEMENT ENTREZ_GENE_ID %T_string;>
38a40,41
> <!ELEMENT groups %T_string;>
> <!ELEMENT abscall %T_string;>
62a66
>       | ENTREZ_GENE_ID
77a82,83
>       | groups
>       | abscall
ADD COMMENT
0
Entering edit mode

Could you report this XML / DTD mismatch to the NCBI as a bug please? Use eutilities (at) ncbi.nlm.nih.gov for this, as stated at the end of this page: http://www.ncbi.nlm.nih.gov/books/NBK25500/

ADD REPLY
0
Entering edit mode

Hi, Peter, I don't know why I didn't get notification of your comment, but I just noticed it. Yes, I reported it, and we are trying to improve the workflow so hopefully this won't happen in the future.

ADD REPLY

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6