Extracting more features from EMBL files with Biopython
1
1
Entering edit mode
9.4 years ago
Lina F ▴ 200

Hi all,

I downloaded .embl files from The SEED and am trying to extract features from them using biopython.

For example, from the following excerpt of an embl file, I'm trying to get the line that contains the /product string:

ID   unknown; SV 1; linear; unassigned DNA; STD; UNC; 9430 BP.
XX
AC   unknown;
XX
DE   Contig AMTS01000351 from Escherichia coli FDA506
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..9430
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon: 1005474"
FT                   /genome_md5="b6a2d1d1a41be1cf3128536aecba12be"
FT                   /project="mshukla_1005474"
FT                   /genome_id="1005474.3"
FT                   /organism="Escherichia coli FDA506"
FT   CDS             154..432
FT                   /db_xref="SEED:fig|1005474.3.peg.3831"
FT                   /translation="MKTKIVKGKTTKQDVLASFGEPDSRSLIDGEEQWSYTMYNSQSKA
FT                   TSFIPVVGLLAGGADSQTKSLTVSFKGEKVSTYIFNAGTSNVKTGIF"
FT                   /product="hypothetical lipoprotein"
...

I've been using SeqIO.parse to get sequence records and looking at record.features, but that's not giving me the /product string:

for record in SeqIO.parse(open(sys.argv[1]),"embl"):
    print record.id, record.features

The output is something like this:

unknown.1 [SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(9430), strand=1), type='source'), SeqFeature(FeatureLocation(ExactPosition(153), ExactPosition(432), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(507), ExactPosition(1710), strand=-1), type='CDS'), 
...

I think there is a way to do it in Bioperl, but what's the equivalent for Biopython?

Thanks for any advice you might have!

embl biopython • 3.8k views
ADD COMMENT
1
Entering edit mode
9.0 years ago
mgalactus ▴ 780

Hi,

The 'product' annotation can be found inside each SeqFeature objects in the 'qualifiers' dictionary

from Bio import SeqIO

s = SeqIO.read('input.embl', 'embl')

for feature in s.features:
    print(feature.qualifiers.get('product', []))

You can also filter the SeqFeature objects by type (feature.type == 'CDS')

Hope this helps...

ADD COMMENT

Login before adding your answer.

Traffic: 1844 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6