I am trying to extract some pieces of information from the .gbk file. For the most common information, I can manage the problem. The problem is when I try to extract the info of "plasmid" and "pathovar" when they are not present. Both files used for tests are in https://github.com/felipelira/files_to_test.
A simple print of the 'features.qualifiers' from each file I got this: ...
Qualifiers from example 1 indicate that not all sequences in the file are from the chromosome, because I have two plasmids. The problem is that including asking for "plasmids" I can retrieve this information in the same way that I can obtain the "country", "organism"...
Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**organism**': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], '**pathovar**': ['actinidiae'], '**strain**': ['ICMP 9853'], '**host**': ['Actinidia'], '**plasmid**': ['p9853_A'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:1104678'], 'collection_date': ['1984'], 'country': ['Japan'], 'pathovar': ['actinidiae'], 'strain': ['ICMP 9853'], 'host': ['Actinidia'], 'plasmid': ['p9853_B'], 'organism': ['Pseudomonas syringae pv. actinidiae ICMP 9853']}
Qualifiers from example 2:
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}
{'mol_type': ['genomic DNA'], 'db_xref': ['taxon:317'], 'collection_date': ['2010'], 'country': ['New Zealand'], 'isolation_source': ['cherry'], 'strain': ['ICMP 3690'], 'organism': ['Pseudomonas syringae']}
As you can see, file 2 doesn't have the same number of information and for the same script, using qualifiers from SeqIO, it doesn't work.
...**Example 1**
FEATURES Location/Qualifiers
source 1..6439609
/organism="Pseudomonas syringae pv. actinidiae ICMP 9853"
/mol_type="genomic DNA"
/strain="ICMP 9853"
/host="Actinidia"
/db_xref="taxon:1104678"
/country="Japan"
/collection_date="1984"
/pathovar="actinidiae"
Trying with the file from example 1 (Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk ), the output of the script is:
... **Example 2**
FEATURES Location/Qualifiers
source 1..267979
/organism="Pseudomonas syringae"
/mol_type="genomic DNA"
/strain="ICMP 3690"
/isolation_source="cherry"
/db_xref="taxon:317"
/country="New Zealand"
/collection_date="2010"
The files that I used for this script are on GitHub And the script is:
import sys
from Bio import SeqIO
from Bio import GenBank
input_file = open(sys.argv[1], "r")
for seq_record in SeqIO.parse(input_file, "genbank"):
for seq_feature in seq_record.features:
if seq_feature.type=="source":
source = seq_feature.qualifiers['organism'][0].replace(' ','_')
strain = seq_feature.qualifiers['strain'][0]
country = seq_feature.qualifiers['country'][0]
#print seq_feature.qualifiers
host = seq_feature.qualifiers['host'][0]
print host
I just included the option 'host' because the problem is the same for 'plasmid' and 'pathovar'.
Anybody can help me? Any suggestion using these same modules of python? I look for the better pythonic way to solve this.
Thank you in advance.