Retrieve qualifiers plasmid and pathovar
1
0
Entering edit mode
7.0 years ago
felipelira3 ▴ 40

I am trying to extract some pieces of information from the .gbk file. For the most common information, I can manage the problem. The problem is when I try to extract the info of "plasmid" and "pathovar" when they are not present. Both files used for tests are in https://github.com/felipelira/files_to_test.

A simple print of the 'features.qualifiers' from each file I got this: ...

Qualifiers from example 1 indicate that not all sequences in the file are from the chromosome, because I have two plasmids. The problem is that including asking for "plasmids" I can retrieve this information in the same way that I can obtain the "country", "organism"...

Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**organism**': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**plasmid**': ['p9853_A'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], 'pathovar': ['actinidiae'], 'strain': ['ICMP 9853'], 'host': ['Actinidia'], 'plasmid': ['p9853_B'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}

Qualifiers from example 2:

{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}

As you can see, file 2 doesn't have the same number of information and for the same script, using qualifiers from SeqIO, it doesn't work.

...**Example 1**
FEATURES             Location/Qualifiers
     source          1..6439609
                     /organism="Pseudomonas syringae pv. actinidiae ICMP 9853"
                     /mol_type="genomic DNA"
                     /strain="ICMP 9853"
                     /host="Actinidia"
                     /db_xref="taxon:1104678"
                     /country="Japan"
                     /collection_date="1984"
                     /pathovar="actinidiae"

Trying with the file from example 1 (Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk ), the output of the script is:

... **Example 2**
FEATURES             Location/Qualifiers
     source          1..267979
                     /organism="Pseudomonas syringae"
                     /mol_type="genomic DNA"
                     /strain="ICMP 3690"
                     /isolation_source="cherry"
                     /db_xref="taxon:317"
                     /country="New Zealand"
                     /collection_date="2010"

The files that I used for this script are on GitHub And the script is:

import sys
from Bio import SeqIO
from Bio import GenBank

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            source = seq_feature.qualifiers['organism'][0].replace(' ','_')
            strain = seq_feature.qualifiers['strain'][0]
            country = seq_feature.qualifiers['country'][0]
            #print seq_feature.qualifiers
            host = seq_feature.qualifiers['host'][0]
            print host

I just included the option 'host' because the problem is the same for 'plasmid' and 'pathovar'.

Anybody can help me? Any suggestion using these same modules of python? I look for the better pythonic way to solve this.

Thank you in advance.

python SeqIO • 1.4k views
ADD COMMENT
3
Entering edit mode
7.0 years ago
mobiusklein ▴ 180

If you would like your script to run to completion even when that information is not present, you can wrap each access to seq_feature.qualifiers in a try-except block catching KeyError and IndexError:

import sys
from Bio import SeqIO
from Bio import GenBank

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            try:
                source = seq_feature.qualifiers['organism'][0].replace(' ','_')
            except (KeyError, IndexError):
                source = None
            try: 
                strain = seq_feature.qualifiers['strain'][0]
            except (KeyError, IndexEror):
                strain = None
            try:
                country = seq_feature.qualifiers['country'][0]
            except (KeyError, IndexError):
                country = None
            #print seq_feature.qualifiers
            try:
                host = seq_feature.qualifiers['host'][0]
            except (KeyError, IndexError):
                host = None
            print host

It's then up to you to make sure that whatever you do with these values is aware they may not be strings.

ADD COMMENT

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6