Hi Biostar,
I have two fasta files that I have blasted against one another and I am trying to make a list (dicitonary) of the top hits in a simple format of Query:Hit
using Biopython. I am running into a an error, however, with the string's format. Here is an the script:
test_dictionary={}
blast_records = NCBIXML.parse(open(outfile))
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
test_dictionary.update({blast_record.query:alignment.title})
and here is an example dictionary entry where the u''
surrounds the value:
u'HA9WEQA08JTIW5': u'gnl|BL_ORD_ID|100 PhosphataseA'
however if I use the print command the values appear correct:
print alignment.title
gnl|BL_ORD_ID|100 PhosphataseA
I am sure this is a simple problem and results in my lack of understanding of precisely how Biopython stores its information. But any suggestions would be appreciated.
thanks zach cp
Edit *** as per DK's answer I ended up using this formulation where I split the output and keep the gene name:
test_dictionary.update[str(blast_record.query)] = str(alignment.title).split()[1]
The strange u thing is to mark a Unicode string in Python 2
You might not want to split it like that. If the gene name is multiple words, you'll only get the first word. I've edited my post to get just the gene name.