Entering edit mode
7.0 years ago
felipelira3
▴
40
Files to test can be downloaded from https://github.com/felipelira/files_to_test.git
I want to retrieve the information from several files in one folder and create a table with the information as a dictionary to create a table after.
#!/usr/bin/env python
import os
import sys
from Bio import SeqIO
from Bio import GenBank
dict1 = {}
input_file = open(sys.argv[1], "r")
for seq_record in SeqIO.parse(input_file, "genbank"):
for seq_feature in seq_record.features:
if seq_feature.type=="source":
try:
source = seq_feature.qualifiers['organism'][0]
except (KeyError, IndexError):
source = 'n/a'
try:
strain = seq_feature.qualifiers['strain'][0]
except (KeyError, IndexEror):
strain = 'n/a'
try:
country = seq_feature.qualifiers['country'][0]
except (KeyError, IndexError):
country = 'n/a'
try:
host = seq_feature.qualifiers['host'][0]
except (KeyError, IndexError):
host = 'n/a'
try:
plasmid = seq_feature.qualifiers['plasmid'][0]
except (KeyError, IndexError):
plasmid = 'n/a'
try:
pathovar = seq_feature.qualifiers['pathovar'][0]
except (KeyError, IndexError):
pathovar = 'n/a'
# Here I have the concatenation of values that I need for the table
value = strain , pathovar , host , plasmid
# Here is where I want to feed the dictionary but refusing if the key and value is already present.
if source not in dict1.keys() and value not in dict1.values():
dict1[source] = value
else:
if source in dict1.keys() and value != dict1[source]:
#if source in dict1.keys() and value not in dict1.values():
dict1[source] = value
For the file Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk , that contains 3 sequences, I have this:
{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'n/a')}
{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'p9853_A')}
{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'p9853_B')}
For the other (Pseudomonas_syringae_str.ICMP_3690_scaffold1.gbk), because it is a scaffold, I have this:
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
The expected result is to obtains just only one key and value for the second sequence and three (or more) for the genomes with sequences such as plasmids.