I have following content in my embl annotation file. I am trying to parse it as it is done for "genBank" files, but I am repeatedly getting error. How to read similar files using biopython?
I am using the following document as my guide: http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html
ID NRP00000001; PRT; NR2; 1 SQ
XX
MF 10830627
PN WO9954462
PR GB19980008350 22-APR-1998
ED 28-OCT-1999 WO9954462 A2
XX
DR EPOP:AX013047;
DE Sequence 74 from Patent WO9954462.
PN WO9954462-A2/74, 28-OCT-1999
XX
FT source 1..358
FT /organism="Mycobacterium leprae"
FT /mol_type="protein"
FT /db_xref="taxon:1769"
XX
SQ Sequence 358 AA; 00001508eba3f78863a4f9cb2463810d; MD5;
//
ID NRP00000002; PRT; NR2; 1 SQ
XX
MF 22767515
PN WO0190366
PR US20000206690P 24-MAY-2000
ED 29-NOV-2001 WO0190366 A2
XX
DR EPOP:AX312021;
DE Sequence 5006 from Patent WO0190366.
PN WO0190366-A2/5006, 29-NOV-2001
XX
FT source 1..65
FT /organism="Homo sapiens"
FT /mol_type="protein"
FT /db_xref="taxon:9606"
XX
SQ Sequence 65 AA; 0000eece8396364fe22b1bdd6821bd63; MD5;
//
ID NRP00210944; PRT; NR2; 2 SQ
XX
MF 9921525
PN WO03020945
PR GB20010021439 05-SEP-2001
ED 13-MAR-2003 WO03020945 A2
XX
DR EPOP:AX716885;
DE Sequence 1 from Patent WO03020945.
PN WO03020945-A2/1, 13-MAR-2003
XX
DR USPOP:ABY00072;
DE Sequence 1 from patent US 7294486.
PN US7294486-A/1, 13-NOV-2007
PN US2005130274 A1 16-JUN-2005
CC First level of publication supplied by the EPO
XX
FT source 1..25
FT /organism="Streptomyces cattleya"
FT /mol_type="protein"
FT /db_xref="taxon:29303"
XX
SQ Sequence 25 AA; 000114cdf14c72e3b188040f9f35f5af; MD5;
//
ID NRP00210945; PRT; NR2; 1 SQ
XX
MF 9954057
PN WO2004078914
PR GB20030004882 04-MAR-2003
ED 16-SEP-2004 WO2004078914 A2
XX
DR EPOP:CQ871087;
DE Sequence 7 from Patent WO2004078914.
PN WO2004078914-A2/7, 16-SEP-2004
XX
FT source 1..25
FT /organism="unidentified"
FT /mol_type="protein"
FT /note="Sequence of unknown origin"
FT /db_xref="taxon:32644"
XX
SQ Sequence 25 AA; 000114cdf14c72e3b188040f9f35f5af; MD5;
//
Reading gives me following error:
>>> SeqIO.read(emblFile, "embl")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 599, in read
first = iterator.next()
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 537, in parse
for r in i:
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
record = self.parse(handle, do_features)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 428, in parse
if self.feed(handle, consumer, do_features):
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 395, in feed
self._feed_first_line(consumer, self.line)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 585, in _feed_first_line
self._feed_first_line_old(consumer, line)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 610, in _feed_first_line_old
self._feed_seq_length(consumer, fields[4])
IndexError: list index out of range
Pasrsing gives me following error:
>>> SeqIO.parse(emblFile, "embl").next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 537, in parse
for r in i:
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
record = self.parse(handle, do_features)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 428, in parse
if self.feed(handle, consumer, do_features):
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 395, in feed
self._feed_first_line(consumer, self.line)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 585, in _feed_first_line
self._feed_first_line_old(consumer, line)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 610, in _feed_first_line_old
self._feed_seq_length(consumer, fields[4])
IndexError: list index out of range
Maybe BioPython can't parse the sequence properly because it's a md5 hash instead of the actual amino acid sequence.
I also figured out that Biopython is incapable to do so. Is there any other python module available that can do so?
what information exactly do you need to extract?
I need most of the information in the above format. I decided to write my own parser to filter out these values. Thanks for comment