I have a huge embl file containing many sequences. I want to print non coding sequence before a CDS. For example I want to print 4670, 5026 given the id:E3HXX6 in an example file: http://www.ebi.ac.uk/ena/data/view/CP002289&display=text
FT gene **4670..5026**
FT /locus_tag="AXYL_06645"
FT CDS 4670..5026
FT /codon_start=1
FT /transl_table=11
FT /locus_tag="AXYL_06645"
FT /product="hypothetical protein"
FT /db_xref="EnsemblGenomes:ADP19930"
FT /db_xref="EnsemblGenomes:AXYL_06645"
FT /db_xref="UniProtKB/TrEMBL:**E3HXX6**"
FT /protein_id="ADP19930.1"
FT /translation="MCYSAQLVADFKRVTRQYGVKIAFEDFVDLYLYHSTDSRPKTPRA
FT LDVAFTAADGPAGAAIQEAVRRWDQLEMQELEDLVFAQRVRLVSAEQALQKKVTKKAEN
FT DQCRRPSTSETCSA"
So I am trying to use Biopython but it is very slow when I have a long list of ids and big EMBL file:
from Bio.SeqIO import parse
id,cds=['E3HXX6'],[]
count,index=1,0
for rec in parse(open('test.embl','r'),'embl'):
for i in rec.features:
index+=1
for j in id:
if j in str(i):
print ' '.join(str(rec.features[index-3].location).split('](')[0].replace('[','').replace('>','').split(':'))
Let me know if you have any suggestions for faster code.
Are you trying to get the gene coordinates?
What is it you are trying to do with the I & j variables? If you are looking for an db_xref, search that directly - right now you are needlessly turning the feature into a string repeatedly (which is partly what is making things slow).
The last line seems needlessly complex, does this not work?
I suppose some kind of indexing will make it faster.
No, you don't need indexing for this. You need to avoid all the string manipulation like 'if f in str(i)' and the print statement and use the provided object attributes directly.
If you want to try indexing the features, read this: www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features
Why do you take
rec.features[index-3]
when your description and the marking in the example suggests you want the previous feature?