Eliminate duplicates skipping same key and values
0
0
Entering edit mode
7.0 years ago
felipelira3 ▴ 40

Files to test can be downloaded from https://github.com/felipelira/files_to_test.git

I want to retrieve the information from several files in one folder and create a table with the information as a dictionary to create a table after.

#!/usr/bin/env python

import os
import sys
from Bio import SeqIO
from Bio import GenBank

dict1 = {}

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            try:
                source = seq_feature.qualifiers['organism'][0]
            except (KeyError, IndexError):
                source = 'n/a'
            try: 
                strain = seq_feature.qualifiers['strain'][0]
            except (KeyError, IndexEror):
                strain = 'n/a'
            try:
                country = seq_feature.qualifiers['country'][0]
            except (KeyError, IndexError):
                country = 'n/a'
            try:
                host = seq_feature.qualifiers['host'][0]
            except (KeyError, IndexError):
                host = 'n/a'
            try:
                plasmid = seq_feature.qualifiers['plasmid'][0]
            except (KeyError, IndexError):
                plasmid = 'n/a'
            try:
                pathovar = seq_feature.qualifiers['pathovar'][0]
            except (KeyError, IndexError):
                pathovar = 'n/a'


# Here I have the concatenation of values that I need for the table

            value = strain , pathovar , host , plasmid

# Here is where I want to feed the dictionary but refusing if the key and value is already present.
        if source not in dict1.keys() and value not in dict1.values():
            dict1[source] = value
        else:
            if source in dict1.keys() and value != dict1[source]:
            #if source in dict1.keys() and value not in dict1.values():
                dict1[source] = value

For the file Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk , that contains 3 sequences, I have this:

{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'n/a')}
{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'p9853_A')}
{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'p9853_B')}

For the other (Pseudomonas_syringae_str.ICMP_3690_scaffold1.gbk), because it is a scaffold, I have this:

{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}

The expected result is to obtains just only one key and value for the second sequence and three (or more) for the genomes with sequences such as plasmids.

Python genbank • 1.0k views
ADD COMMENT

Login before adding your answer.

Traffic: 2017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6