I am re-analyzing some microarray datasets where GenBank and RefSeq accession IDs are the best identifiers for mapping probes to genes. My primary source for doing the mappings has been the Org.Hs.eg.db package in R, but this does not contain all GenBank accession numbers, and I have been using S.O.U.R.C.E. as a backup. However, this requires some manual copying-and-pasting, and I also worry that S.O.U.R.C.E. may be outdated. Based on the suggestion from this post, I decided to use the Bio.Unigene parser from Biopython to extract a complete list of GenBank and Refseq identifiers from the Hs.data file. I downloaded Hs.data from the UniGene website: ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/Hs.data.gz
This is my script:
from Bio import UniGene
input = open("Hs.data")
records = UniGene.parse(input)
i = 0
mystring = ""
for record in records:
mystring += "Ug=" + record.ID + ":"
mystring += "Ez=" + record.gene_id + ":"
for j in range(len(record.sequence)):
mystring += record.sequence[j].acc + ";"
mystring += "|"
i += 1
When I run this script, I encounter the following error:
>>>
Traceback (most recent call last):
File "C:\Users\Lukas\Documents\Python projects\Parse Hs.data from UniGene\Parse - error.py", line 10, in <module>
for record in records:
File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 252, in parse
record = _read(handle)
File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 314, in _read
sts = STSLine(value)
File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 187, in __init__
self._init_from_text(text)
File "C:\Python27\lib\site-packages\Bio\UniGene\__init__.py", line 193, in _init_from_text
key, val = part.split("=")
ValueError: too many values to unpack
>>> i
114161
>>>
I get the exact same error when I run this minimal script:
from Bio import UniGene
input = open("Hs.data")
records = UniGene.parse(input)
for record in records:
pass
At this point, there are still about 20,000 UniGene records that I haven’t yet processed. I would like to skip the record that causes an error and move on to the next one. How would I do this? The “records” variable refers to a generator object, and the error seems to be caused by the __init__ function for iteration number 114161.
Thanks for your help!
UPDATE: I used Vim for Windows to correct a duplicated equals sign in the original Hs.data file that was causing the problem:
// ID Hs.716839 TITLE Transcribed locus EXPRESS kidney| normal CHROMOSOME 22 STS ACC=SHGC-3369 UNISTS=33527 STS ACC==D22S272 UNISTS=45191 STS ACC=D22S923 UNISTS=92346 SCOUNT 3 SEQUENCE ACC=AI272020.1; NID=g3891187; CLONE=IMAGE:1866118; END=3'; LID=1045; SEQTYPE=EST SEQUENCE ACC=GD144965.1; NID=g215345870; END=3'; LID=23569; SEQTYPE=EST SEQUENCE ACC=GD144918.1; NID=g215328149; END=3'; LID=23569; SEQTYPE=EST //
After correcting that, the code ran without incident. Thank you everyone!
Sounds like a bug, have you reported it to the Biopython team? e.g. via https://github.com/biopython/biopython/issues or the mailing list
Also which version of Biopython are you using?
Thanks for your suggestion. I have submitted an issue to GitHub.
I'm using Biopython version 1.63 and Python version 2.7.6.