Hi all ^_^
I'm working on a script in biopython that extracts SSR's from some plant species , blast their sequences against each other and group the results of similarity based on some coverage value .
here's the part of the code that does the filtering task and save the results to external files :
import sys
from Bio.SeqRecord import SeqRecord
from Bio.Blast import NCBIXML
right_database_names=["Right-Aegilops","Right-Brachypodium","Right-Hordeum"]
left_reverse_names=["RCLeft-Aegilops","RCLeft-Brachypodium","RCLeft-Hordeum"]
coverage=00.00
for indx,query in enumerate(left_reverse__names) :
for indx2,database in enumerate(right_database_names) :
if indx2>indx:
filtered_outname=("%s_%s"%(query,database))
outname=("%s.xml" %filtered_outname)
records=NCBIXML.parse(open(outname))
save_file1 = open("95_"+filtered_outname+".fasta", 'w')
save_file2 = open("85_"+filtered_outname+".fasta", 'w')
save_file3 = open("75_"+filtered_outname+".fasta", 'w')
for record in records :
for alignment in record.alignments:
for hsp in alignment.hsps:
if (int(hsp.align_length)<int(record.query_length)):
coverage=(int(hsp.align_length)/int(record.query_length))
if (coverage*100>=95.00):
save_file1.write('>%s\n' % (alignment.title))
save_file1.write('%s\n'%hsp.query)
elif ((coverage*100)>=85.00 and (coverage*100)<95.00):
save_file2.write('>%s\n' % (alignment.title))
save_file2.write('%s\n'%hsp.query)
elif ((coverage*100)>=75.00 and (coverage*100)<85.00):
save_file3.write('>%s\n' % (alignment.title))
save_file3.write('%s\n'%hsp.query)
else :
coverage=(int(record.query_length)/int(hsp.align_length))
if ((coverage*100)>=95.00 ):
save_file1.write('>%s\n' % (alignment.title))
save_file1.write('%s\n'%hsp.query)
elif ((coverage*100)>=85.00 and (coverage*100)<95.00):
save_file2.write('>%s\n' % (alignment.title))
save_file2.write('%s\n'%hsp.query)
elif ((coverage*100)>=75.00 and (coverage*100)<85.00):
save_file3.write('>%s\n' % (alignment.title))
save_file3.write('%s\n'%hsp.query)
save_file1.close()
save_file2.close()
save_file3.close()
else:
continue
The code above does produce some results , but I'm not sure if they are really what I'm looking for .
I have argued with my thesis supervisor about the coverage equation ... he told me that it's the query seq. length divided by the hit seq. length , while I thought it's the hit alignment seq. length divided by the query seq. length !!
I wonder if I managed to map the tags of the xml report to their variable names in biopython platform correctly?
Does the variable name hsp.align_length
in biopython refers to <Hsp_align_len>
tag in the xml report? If so, then what variable name refers to the <Hit_len>
in the xml report?
What is the difference between hsp.align_length
and alignment.length
?
I have read the biopython builtin dictionary for blast xml format , and there's nothing seems to refer to the Hit length !!
When I searched about this issue, I came across this site and it's been very confusing when I tried to apply variable names and attributes inside biopython 3.4
Also, I need to know if I wrote the equation that calculates the coverage correctly - in the first place.
Please ... correct me if I got anything wrong in this matter , and many thanks in advance
After our next release http://biopython.org/DIST/docs/api/Bio.SearchIO.BlastIO-module.html should look much prettier (colour code samples, proper tables, etc).