Question

Retrieve qualifiers plasmid and pathovar

0

Entering edit mode

7.0 years ago

felipelira3 ▴ 40

I am trying to extract some pieces of information from the .gbk file. For the most common information, I can manage the problem. The problem is when I try to extract the info of "plasmid" and "pathovar" when they are not present. Both files used for tests are in https://github.com/felipelira/files_to_test.

A simple print of the 'features.qualifiers' from each file I got this: ...

Qualifiers from example 1 indicate that not all sequences in the file are from the chromosome, because I have two plasmids. The problem is that including asking for "plasmids" I can retrieve this information in the same way that I can obtain the "country", "organism"...

Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**organism**': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**plasmid**': ['p9853_A'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], 'pathovar': ['actinidiae'], 'strain': ['ICMP 9853'], 'host': ['Actinidia'], 'plasmid': ['p9853_B'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}

Qualifiers from example 2:

{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}

As you can see, file 2 doesn't have the same number of information and for the same script, using qualifiers from SeqIO, it doesn't work.

...**Example 1**
FEATURES             Location/Qualifiers
     source          1..6439609
                     /organism="Pseudomonas syringae pv. actinidiae ICMP 9853"
                     /mol_type="genomic DNA"
                     /strain="ICMP 9853"
                     /host="Actinidia"
                     /db_xref="taxon:1104678"
                     /country="Japan"
                     /collection_date="1984"
                     /pathovar="actinidiae"

Trying with the file from example 1 (Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk ), the output of the script is:

... **Example 2**
FEATURES             Location/Qualifiers
     source          1..267979
                     /organism="Pseudomonas syringae"
                     /mol_type="genomic DNA"
                     /strain="ICMP 3690"
                     /isolation_source="cherry"
                     /db_xref="taxon:317"
                     /country="New Zealand"
                     /collection_date="2010"

The files that I used for this script are on GitHub And the script is:

import sys
from Bio import SeqIO
from Bio import GenBank

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            source = seq_feature.qualifiers['organism'][0].replace(' ','_')
            strain = seq_feature.qualifiers['strain'][0]
            country = seq_feature.qualifiers['country'][0]
            #print seq_feature.qualifiers
            host = seq_feature.qualifiers['host'][0]
            print host

I just included the option 'host' because the problem is the same for 'plasmid' and 'pathovar'.

Anybody can help me? Any suggestion using these same modules of python? I look for the better pythonic way to solve this.

Thank you in advance.

python SeqIO • 1.4k views

ADD COMMENT • link updated 7.0 years ago by mobiusklein ▴ 180 • written 7.0 years ago by felipelira3 ▴ 40

score 3 · Accepted Answer · 2017-12-04

If you would like your script to run to completion even when that information is not present, you can wrap each access to seq_feature.qualifiers in a try-except block catching KeyError and IndexError:

import sys
from Bio import SeqIO
from Bio import GenBank

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            try:
                source = seq_feature.qualifiers['organism'][0].replace(' ','_')
            except (KeyError, IndexError):
                source = None
            try: 
                strain = seq_feature.qualifiers['strain'][0]
            except (KeyError, IndexEror):
                strain = None
            try:
                country = seq_feature.qualifiers['country'][0]
            except (KeyError, IndexError):
                country = None
            #print seq_feature.qualifiers
            try:
                host = seq_feature.qualifiers['host'][0]
            except (KeyError, IndexError):
                host = None
            print host

It's then up to you to make sure that whatever you do with these values is aware they may not be strings.