Biopython: work with loops and qualifiers from multiple records
1
1
Entering edit mode
7.6 years ago
dago ★ 2.8k

I have a gbk of a draft bacterial genomes (meaning multiple contigs). I want to extract some info about specific genes, and I get an error I cannot understand.

Parse the gbk

record_dict = SeqIO.to_dict(SeqIO.parse("SpeciesA.gbk", "genbank"))

My file looks:

    for i in record_dict:
    ...     for f in record_dict[i].features:
    ...             print f.qualifiers
{'locus_tag': ['AA_03640'], 'gene': ['ftsH_4']}
{'locus_tag': ['AA_03640'], 'inference': ['ab initio prediction:Prodigal:2.6', 'similar to AA sequence:UniProtKB:P37476'], 'codon_start': ['1'], 'EC_number': ['3.4.24.-'], 'transl_table': ['11'], 'product': ['ATP-dependent zinc metalloprotease FtsH'], ........

Now if I try to get out the contig name where the gene AA_03640 is in, as follow:

for i in record_dict:
...     for f in record_dict[i].features:
...             if f.qualifiers['locus_tag'] == 'AA_03640':
                    print(i)

But I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
KeyError: 'locus_tag'

Any help to understand where I am wrong? Thanks

genome python gene software error biopython • 3.7k views
ADD COMMENT
1
Entering edit mode

The error is quite self-explanatory I would say, and means that f.qualifiers doesn't have a key "locus_tag"...

You can check which keys it does have using f.qualifiers.keys()

ADD REPLY
1
Entering edit mode

I thought the same, but when I check the keys locus_tag is in there:

['locus_tag', 'gene']
['locus_tag', 'inference', 'codon_start', 'EC_number', 'transl_table', 'product', 'translation', 'gene', 'protein_id']
ADD REPLY
1
Entering edit mode

Odd. For a moment I thought this was an easy question ;-)

Maybe use a try-except block:

try:
    if f.qualifiers['locus_tag'] == 'AA_03640':
        #stuff
except KeyError:
    print(f.qualifiers)

An unrelated problem is that you should check for f.qualifiers['locus_tag'] == ['AA_03640'], since it's a list and not a character.

ADD REPLY
0
Entering edit mode

ok, I am officially confused. If I run:

for i in record_dict:
    for f in record_dict[i].features:
        print f.qualifiers.keys()

I get

['locus_tag', 'inference', 'codon_start', 'product', 'transl_table', 'translation', 'protein_id']
['locus_tag']
['locus_tag', 'inference', 'codon_start', 'product', 'transl_table', 'translation', 'protein_id']
['locus_tag']

But If I run

for i in record_dict:
    for f in record_dict[i].features:
        try:
            if f.qualifiers['locus_tag'] == ['AA_03640']:
                print i
        except KeyError:
            print(f.qualifiers.keys())

I get

['estimated_length', 'linkage_evidence', 'gap_type']
['strain', 'mol_type', 'organism']
['strain', 'mol_type', 'organism']

I should have the same keys, shouldn't I?

ADD REPLY
2
Entering edit mode

To answer your question, no, you should not assume all the features have the same keys. It varies dramatically by the feature type - the source feature (usually there is only one) has things like the organism, while while CDS and gene features have things like a locus tag, and the CDS often has a translation. You probably want to add if f.type == "CDS": or if f.type == "gene": to your loop. See also http://www.warwick.ac.uk/go/peter_cock/python/genbank which does something similar to what I think you want to achieve.

(I have expanded on this as a full answer)

ADD REPLY
0
Entering edit mode

I am officially confused

I join being confused.

I don't really have experience with genbank files so I can't really nail this down. Maybe something for Peter, although I'm not sure if he visits biostars often...

ADD REPLY
0
Entering edit mode

Double check your input genbank isn't malformed and that its /locus_tag fields are correct?

If you hit Peter up on twitter he's been really helpful to me in the past (and hes always on twitter ;) )

ADD REPLY
0
Entering edit mode

I checked the gbk, it looks fine. I will try twitter..:)

ADD REPLY
2
Entering edit mode

Someone kindling pinged me on Twitter

ADD REPLY
0
Entering edit mode

They were faster then me!=)

ADD REPLY
2
Entering edit mode
7.6 years ago
Peter 6.0k

Not all feature types will have a locus tag - you probably only care about CDS or gene features. Even then, not genes will have a locus tag (although generally within a record they either all do, or all don't).

Finally if there is a locus tag, then f.qualifiers['locus_tag'] will return a list (since in general the annotation qualifiers can appear more than once), so you'd want to check using 'AA_03640' in f.qualifiers['locus_tag'] or similar, e.g.

for i in record_dict:
    for f in record_dict[i].features:
        if f.type == "CDS":
            if 'AA_03640' in f.qualifiers.get('locus_tag', []):
                print(i)

See also http://www.warwick.ac.uk/go/peter_cock/python/genbank which does something similar to what I think you want to achieve.

ADD COMMENT
0
Entering edit mode

Hi, thank you very much for the answer! You are right! Adding if f.type == "CDS" fixed it!

ADD REPLY

Login before adding your answer.

Traffic: 3983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6