Running Biopython 1.57 on Bio-Linux 6. I have a list of names of genes that are conserved across several bacterial genomes. In order to pull these out I'd like to make a dictionary that parses a defline that looks like this:
>lcl|NC_000913.2_cdsid_NP_414542.1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]
and uses the gene name as the key. So I wrote:
def get_gene(identifier):
seqrline = identifier.description.split(' [')
seqrline = [x.strip(']') for x in seqrline]
gene_name = seqrline[1].lstrip('gene=')
return gene_name
handle = open('NC_000913.faat', 'r')
record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
But there are duplicates, naturally:
>>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, 'fasta'), key_function=get_gene)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/biopython-1.57-py2.6-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 673, in to_dict
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'insB'
Is there any way to get around this?
EDIT: Ok, so there was an obvious way to get to the same place using the default keys generated by SeqIO.to_dict
by means of a grep chain:
grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys
The question about duplicate keys still stands, though.
Ok, so there was an obvious way to get to the same place using the default keys generated by
SeqIO.to_dict
grep ">" NC_000913.faa |grep -f sico_names.txt | grep -oh 'lcl[^ ]*' > NC_000913.keys The question about duplicate keys still stands, though.