Question

Mapping blast xml report elements to their variable names in biopython

0

Entering edit mode

10.0 years ago

Bara'a ▴ 270

Hi all ^_^

I'm working on a script in biopython that extracts SSR's from some plant species , blast their sequences against each other and group the results of similarity based on some coverage value .

here's the part of the code that does the filtering task and save the results to external files :

import sys
from Bio.SeqRecord import SeqRecord
from Bio.Blast import NCBIXML

right_database_names=["Right-Aegilops","Right-Brachypodium","Right-Hordeum"]
left_reverse_names=["RCLeft-Aegilops","RCLeft-Brachypodium","RCLeft-Hordeum"]

coverage=00.00
for indx,query in enumerate(left_reverse__names) :
    for indx2,database in enumerate(right_database_names) :
        if indx2>indx:
            filtered_outname=("%s_%s"%(query,database))
            outname=("%s.xml" %filtered_outname)
            records=NCBIXML.parse(open(outname))    
            save_file1 = open("95_"+filtered_outname+".fasta", 'w')
            save_file2 = open("85_"+filtered_outname+".fasta", 'w')
            save_file3 = open("75_"+filtered_outname+".fasta", 'w')
            for record in records :
                for alignment in record.alignments:
                    for hsp in alignment.hsps:
                        if (int(hsp.align_length)<int(record.query_length)):
                            coverage=(int(hsp.align_length)/int(record.query_length))
                            if (coverage*100>=95.00):
                                save_file1.write('>%s\n' % (alignment.title))
                                save_file1.write('%s\n'%hsp.query)
                            elif ((coverage*100)>=85.00 and (coverage*100)<95.00):
                                save_file2.write('>%s\n' % (alignment.title))
                                save_file2.write('%s\n'%hsp.query)
                            elif ((coverage*100)>=75.00 and (coverage*100)<85.00):
                                save_file3.write('>%s\n' % (alignment.title))
                                save_file3.write('%s\n'%hsp.query)
                        else :
                            coverage=(int(record.query_length)/int(hsp.align_length))
                            if ((coverage*100)>=95.00 ):
                                save_file1.write('>%s\n' % (alignment.title))
                                save_file1.write('%s\n'%hsp.query) 
                            elif ((coverage*100)>=85.00 and (coverage*100)<95.00):
                                save_file2.write('>%s\n' % (alignment.title))
                                save_file2.write('%s\n'%hsp.query)
                            elif ((coverage*100)>=75.00 and (coverage*100)<85.00):
                                save_file3.write('>%s\n' % (alignment.title))
                                save_file3.write('%s\n'%hsp.query)                                            
            save_file1.close()
            save_file2.close()
            save_file3.close()
    else:
        continue

The code above does produce some results , but I'm not sure if they are really what I'm looking for .

I have argued with my thesis supervisor about the coverage equation ... he told me that it's the query seq. length divided by the hit seq. length , while I thought it's the hit alignment seq. length divided by the query seq. length !!

I wonder if I managed to map the tags of the xml report to their variable names in biopython platform correctly?

Does the variable name hsp.align_length in biopython refers to <Hsp_align_len> tag in the xml report? If so, then what variable name refers to the <Hit_len> in the xml report?

What is the difference between hsp.align_length and alignment.length?

I have read the biopython builtin dictionary for blast xml format , and there's nothing seems to refer to the Hit length !!

When I searched about this issue, I came across this site and it's been very confusing when I tried to apply variable names and attributes inside biopython 3.4

Also, I need to know if I wrote the equation that calculates the coverage correctly - in the first place.

Please ... correct me if I got anything wrong in this matter , and many thanks in advance

Biopython xml Blast • 3.1k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Bara'a ▴ 270

1

Entering edit mode

After our next release http://biopython.org/DIST/docs/api/Bio.SearchIO.BlastIO-module.html should look much prettier (colour code samples, proper tables, etc).

ADD REPLY • link 10.0 years ago by Peter 6.0k

Ram · Accepted Answer · 2014-11-17

2

Entering edit mode

10.0 years ago

Peter 6.0k

Yes, hsp.align_length comes from <Hsp_align_len> in the XML, while alignment.length comes from <Hit_len>.

If you want the coverage, you can ask BLAST to give you this in the tabular output (or comma separated output). Personally I find this more comprehensive than the XML output, and easier to parse. See also https://github.com/biopython/biopython/pull/431 which looks at how to calculate these numbers from the BLAST XML output.

Biopython now includes Bio.SearchIO as an experimental module which parses BLAST XML and other pairwise search/alignment file formats. In the long term this may replace Bio.Blast.NCBIXML so you may want to look at that but it has a less direct mapping to the BLAST output.

ADD COMMENT • link 10.0 years ago by Peter 6.0k

0

Entering edit mode

Peter ... you are a life saver , I can't thank you enough ^_^

I think I should switch to the tabular format since it's more comprehensive as you advised .

Thank you very much

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Bara'a ▴ 270

1

Entering edit mode

Like on StackExchange, on BioStars you can mark an answer as "accepted" which is linked to the reputation score of each user. For some people this is motivating.

Anyway, glad I could help.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Peter 6.0k