Mapping blast xml report elements to their variable names in biopython
1
0
Entering edit mode
10.0 years ago
Bara'a ▴ 270

Hi all ^_^

I'm working on a script in biopython that extracts SSR's from some plant species , blast their sequences against each other and group the results of similarity based on some coverage value .

here's the part of the code that does the filtering task and save the results to external files :

import sys
from Bio.SeqRecord import SeqRecord
from Bio.Blast import NCBIXML

right_database_names=["Right-Aegilops","Right-Brachypodium","Right-Hordeum"]
left_reverse_names=["RCLeft-Aegilops","RCLeft-Brachypodium","RCLeft-Hordeum"]

coverage=00.00
for indx,query in enumerate(left_reverse__names) :
    for indx2,database in enumerate(right_database_names) :
        if indx2>indx:
            filtered_outname=("%s_%s"%(query,database))
            outname=("%s.xml" %filtered_outname)
            records=NCBIXML.parse(open(outname))    
            save_file1 = open("95_"+filtered_outname+".fasta", 'w')
            save_file2 = open("85_"+filtered_outname+".fasta", 'w')
            save_file3 = open("75_"+filtered_outname+".fasta", 'w')
            for record in records :
                for alignment in record.alignments:
                    for hsp in alignment.hsps:
                        if (int(hsp.align_length)<int(record.query_length)):
                            coverage=(int(hsp.align_length)/int(record.query_length))
                            if (coverage*100>=95.00):
                                save_file1.write('>%s\n' % (alignment.title))
                                save_file1.write('%s\n'%hsp.query)
                            elif ((coverage*100)>=85.00 and (coverage*100)<95.00):
                                save_file2.write('>%s\n' % (alignment.title))
                                save_file2.write('%s\n'%hsp.query)
                            elif ((coverage*100)>=75.00 and (coverage*100)<85.00):
                                save_file3.write('>%s\n' % (alignment.title))
                                save_file3.write('%s\n'%hsp.query)
                        else :
                            coverage=(int(record.query_length)/int(hsp.align_length))
                            if ((coverage*100)>=95.00 ):
                                save_file1.write('>%s\n' % (alignment.title))
                                save_file1.write('%s\n'%hsp.query) 
                            elif ((coverage*100)>=85.00 and (coverage*100)<95.00):
                                save_file2.write('>%s\n' % (alignment.title))
                                save_file2.write('%s\n'%hsp.query)
                            elif ((coverage*100)>=75.00 and (coverage*100)<85.00):
                                save_file3.write('>%s\n' % (alignment.title))
                                save_file3.write('%s\n'%hsp.query)                                            
            save_file1.close()
            save_file2.close()
            save_file3.close()
    else:
        continue

The code above does produce some results , but I'm not sure if they are really what I'm looking for .

I have argued with my thesis supervisor about the coverage equation ... he told me that it's the query seq. length divided by the hit seq. length , while I thought it's the hit alignment seq. length divided by the query seq. length !!

I wonder if I managed to map the tags of the xml report to their variable names in biopython platform correctly?

Does the variable name hsp.align_length in biopython refers to <Hsp_align_len> tag in the xml report? If so, then what variable name refers to the <Hit_len> in the xml report?

What is the difference between hsp.align_length and alignment.length?

I have read the biopython builtin dictionary for blast xml format , and there's nothing seems to refer to the Hit length !!

When I searched about this issue, I came across this site and it's been very confusing when I tried to apply variable names and attributes inside biopython 3.4

Also, I need to know if I wrote the equation that calculates the coverage correctly - in the first place.

Please ... correct me if I got anything wrong in this matter , and many thanks in advance

Biopython xml Blast • 3.1k views
ADD COMMENT
1
Entering edit mode

After our next release http://biopython.org/DIST/docs/api/Bio.SearchIO.BlastIO-module.html should look much prettier (colour code samples, proper tables, etc).

ADD REPLY
2
Entering edit mode
10.0 years ago
Peter 6.0k

Yes, hsp.align_length comes from <Hsp_align_len> in the XML, while alignment.length comes from <Hit_len>.

If you want the coverage, you can ask BLAST to give you this in the tabular output (or comma separated output). Personally I find this more comprehensive than the XML output, and easier to parse. See also https://github.com/biopython/biopython/pull/431 which looks at how to calculate these numbers from the BLAST XML output.

Biopython now includes Bio.SearchIO as an experimental module which parses BLAST XML and other pairwise search/alignment file formats. In the long term this may replace Bio.Blast.NCBIXML so you may want to look at that but it has a less direct mapping to the BLAST output.

ADD COMMENT
0
Entering edit mode

Peter ... you are a life saver , I can't thank you enough ^_^

I think I should switch to the tabular format since it's more comprehensive as you advised .

Thank you very much

ADD REPLY
1
Entering edit mode

Like on StackExchange, on BioStars you can mark an answer as "accepted" which is linked to the reputation score of each user. For some people this is motivating.

Anyway, glad I could help.

ADD REPLY

Login before adding your answer.

Traffic: 1807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6