I am trying to parse Humicola genome "embl" format, and extract all its gene id. I quite new in this stuff, so may need some help in termology as well ! Seems in the file, there is more than one section for features (meaning it could be more than 1 genome).how do you call these part in flatfile ?!
Anyway, so for this purpose, I wrote a small python script and used BioPython,
embl = dict()
handle = open('MyGenome.embl', "rU")
for cnt, record in enumerate(SeqIO.parse(handle, "embl")) :
print len(record.features)
print cnt
handle.close()
but I am getting the following error in 10th record.features.(could be because its corrupted or there is bug withing my biopython (version 1.58) or may be you know an easier parse!)
here is the error
File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 432, in parse_records
record = self.parse(handle, do_features)
File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 415, in parse
if self.feed(handle, consumer, do_features):
File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 387, in feed
self._feed_feature_table(consumer, self.parse_features(skip=False))
File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 184, in parse_features
features.append(self.parse_feature(feature_key, feature_lines))
File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 281, in parse_feature
assert len(qualifiers) > 0
Where I track genes that I am tracking, the one after last one in embl file is like
FT exon complement(494..1420)
FT /note="exon_id=CADAORAE00020594"
and after that, another section (genome) start,which is look like this
ID XCFFLD351 Unannotated; unspecified; UNC; 494655 BP.
XX
AC Z2573200;
XX
DT 01-OCT-2010 (Rel. 01, Created)
DT 01-OCT-2010 (Rel. 01, Last updated, Version 1)
XX
DE Humicola xisolens
XX
KW genomic.
XX
OS Humicola xisolens O45MC-2
OC Eukaryota; Fungi;
OC .
XX
RN [1]
RP 1-494655
RA MXKA;
RT Humicola xisolens
RL alaki@dolaki.com;
XX
DR seq; PC; O45PC.
CC length=494655
XX
FH Key Location/Qualifiers
FH
FT CDS complement(join(178..370,439..529,588..678,1009..1060,
FT 1584..1773,1831..1948,2838..2879))
FT /protein_id="PE04150001210"
FT /db_xref="PEDANT:PE04150001210"
Also, I am wondering how can I extract whole the part that causing error, and paste here ?!! I know its 10th record.feature! here is the file
Could you provide the identifier of the record you're having trouble with?
I update the question, actually I am wondering what is the best way to extract that part ?! I know its 10th record
Where is 'MyGenome.embl' from? To look at the 10th record, use 'less MyGenome.embl' and then search for the 10th record (with /ID). Then you can manually look at it to see if it's complete; the error seems to indicate something is wrong with the qualifiers in the feature table (FT section).
To look at the 10th record, use 'less MyGenome.embl' and then search for the 10th record (with /ID). Then you can manually examine it to see if it's complete; the error seems to indicate something is wrong with the qualifiers in the feature table (FT section).
the file is almost 110M, but I put new info, which could give you some clue.Thanks
How was this file generated? It looks pretty non-standard, which might be what is causing issues. If you can't spot the problem by looking at the record, you can add a 'print line' to '/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py' in 'parse_feature' after 'for line in iterator:'. It's hard to help without being able to reproduce this; it would help if you could make the file available.
I provide a link to file