Is There A Way To Skip Existing Keys In Seq.Io.To_Dict? Or Is There A Better Way Altogether?
1
3
Entering edit mode
13.3 years ago

Running Biopython 1.57 on Bio-Linux 6. I have a list of names of genes that are conserved across several bacterial genomes. In order to pull these out I'd like to make a dictionary that parses a defline that looks like this:

>lcl|NC_000913.2_cdsid_NP_414542.1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]

and uses the gene name as the key. So I wrote:

def get_gene(identifier):
    seqrline = identifier.description.split(' [')
    seqrline = [x.strip(']') for x in seqrline]
    gene_name = seqrline[1].lstrip('gene=')
    return gene_name

handle = open('NC_000913.faat', 'r')
record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)

But there are duplicates, naturally:

>>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/biopython-1.57-py2.6-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 673, in to_dict
    raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'insB'

Is there any way to get around this?

EDIT: Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict by means of a grep chain:

grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys

The question about duplicate keys still stands, though.

biopython • 5.0k views
ADD COMMENT
0
Entering edit mode

Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys The question about duplicate keys still stands, though.

ADD REPLY
7
Entering edit mode
13.3 years ago

SeqIO.to_dict is meant to handle some sets of standard cases, but here you should parse the file and build up a dictionary correctly handling duplicate genes. Assuming you want to collect all of the records for a gene name together as a list:

import collections

from Bio import SeqIO

record_dict = collections.defaultdict(list)
with open('test.fa', 'r') as in_handle:
    for rec in SeqIO.parse(in_handle, "fasta"):
        cur_id = get_gene(rec)
        record_dict[cur_id].append(rec)

for key, vals in record_dict.iteritems():
    print key, vals

which for a small test file like:

> test [gene=A] 
GATC
> test2 [gene=B] 
GATC
> test3 [gene=A] 
GATC

will generate:

A [SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test', name='test', description=' test [gene=A]', dbxrefs=[]), 
   SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test3', name='test3', description=' test3 [gene=A]', dbxrefs=[])]
B [SeqRecord(seq=Seq('GATC', SingleLetterAlphabet()), id='test2', name='test2', description=' test2 [gene=B]', dbxrefs=[])]
ADD COMMENT
0
Entering edit mode

How would the schematics of the code change if James had multiple files? For example, NC_000913.faa thru NC_000923.faa.

ADD REPLY

Login before adding your answer.

Traffic: 2745 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6