[BioPython] Parse blast tabular output
1
5
Entering edit mode
7.5 years ago
Rox ★ 1.4k

Hi everyone,

I am trying to parse a blast result produced using outfmt 6 option.

I have made several tries, with iterators, without iterators... But each time it fails to parse my file.

Here some code that I try :

parser = argparse.ArgumentParser() parser.add_argument("blast_file", help="The path of the file containing blast result in xml format") args = parser.parse_args()

results = open(args.blast_file, "r")

blast_parser = NCBIStandalone.BlastParser() blast_records = blast_parser.parse(results)

for blast_record in blast_records:
    E_VALUE_THRESH = 0.0004
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < E_VALUE_THRESH:
                print('****Alignment****')
                print('sequence:', alignment.title)
                print('length:', alignment.length)
                print('e value:', hsp.expect)
                if len(hsp.query) > 75:
                    dots = '...'
                else:
                    dots = ''
                print(hsp.query[0:75] + dots)
                print(hsp.match[0:75] + dots)
                print(hsp.sbjct[0:75] + dots)

But then, it showed this error :

python parse_last_hit.py /media/loutre/SUZUKII/assembly/duplication_removal/2017/Blast/Contig_37_orf.txt
/usr/lib/python2.7/dist-packages/Bio/Blast/NCBIStandalone.py:57: BiopythonDeprecationWarning: This module has been deprecated. Consider Bio.SearchIO for parsing BLAST output instead.
  "parsing BLAST output instead.", BiopythonDeprecationWarning)
/usr/lib/python2.7/dist-packages/Bio/ParserSupport.py:29: BiopythonDeprecationWarning: Bio.ParserSupport is now deprecated will be removed in a future release of Biopython.
  "future release of Biopython.", BiopythonDeprecationWarning)
Traceback (most recent call last):
  File "parse_last_hit.py", line 14, in <module>
    blast_records = blast_parser.parse(results)
  File "/usr/lib/python2.7/dist-packages/Bio/Blast/NCBIStandalone.py", line 836, in parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/lib/python2.7/dist-packages/Bio/Blast/NCBIStandalone.py", line 118, in feed
    read_and_call_until(uhandle, consumer.noevent, contains='BLAST')
  File "/usr/lib/python2.7/dist-packages/Bio/ParserSupport.py", line 320, in read_and_call_until
    line = safe_readline(uhandle)
  File "/usr/lib/python2.7/dist-packages/Bio/ParserSupport.py", line 400, in safe_readline
    raise ValueError("Unexpected end of stream.")
ValueError: Unexpected end of stream.

By googling it, I found that it may be a problem of the blast format issue (Problems With Biopython When Running The Ncbistandalone.Py Program)

I really don't want to go through XML, I can't allow it because I have a lot of Blast with huge sequences to do. Producing xml takes too much time and too much storage.

Does anyone know a way using Biopython to parse through blast result in tabular format ? Thanks a lot for your answers !

python biopython blast • 9.9k views
ADD COMMENT
3
Entering edit mode

Have you tried using SearchIO with the blast tabular output? It's pretty straighforward. e.g:

blast_result = SearchIO.read(resultHandle, 'blast-tab')

print(blast_result[0][0])
start = blast_result[0][0].hit_start
end = blast_result[0][0].hit_end

Also, you can treat the outfmt 6 as just a plain delimited text file, so you don't actually need to use a specialised parser at all necessarily (e.g: https://github.com/MicroInfect/bioinfx/blob/master/blastfilterer.py )

ADD REPLY
0
Entering edit mode

SearchIO is a nice suggestion, though my blast result will contain a lot of raws, and the function read can work only with blast output with exactly one result.

But searchIO has also a parse function, which handle multi result blast file, but when I tried it didn't worked. I'm trying again see if you can understand the error.

ADD REPLY
0
Entering edit mode

Ah right I remember, parse from searchIO support only xml so that's why I'm not happy with it

ADD REPLY
0
Entering edit mode

About doing the parser myself without biopython, I already thought about that, But I was stucked at the part where I had to read powered numbers with python...

Like when you have something like this : 5e-43, how do you store this in python ? It's been a while since I didn't worked in python...

ADD REPLY
2
Entering edit mode

You can just use float("5e-43") to get your values converted.

ADD REPLY
0
Entering edit mode

Really ? Ow damn, I was stucked with this question for so long... Thanks a lot then I'm going to try ! it sounds dumb now

ADD REPLY
9
Entering edit mode
7.5 years ago

A long time ago I wrote a module for handling tabular BLAST output (-outfmt 6 / -m8). Maybe you will find it useful.

\gist 

SAMPLE USAGE:

Filename: test.outfmt6

cgl|CAGL0A00187g    ecy|Ecym_8168   31.11   90  49  3   597 676 95  181 1e-04   42.7
cgl|CAGL0A00187g    ecy|Ecym_8168   36.36   44  28  0   594 637 235 278 7e-04   40.0
cgl|CAGL0A00187g    ecy|Ecym_4273   27.38   84  58  1   19  102 4055    4135    0.64    31.2
cgl|CAGL0A00209g    ecy|Ecym_1435   69.91   751 188 4   13  725 11  761 0.0 1082
cgl|CAGL0A00209g    ecy|Ecym_5414   25.84   89  41  3   431 508 117 191 7.0 27.3
cgl|CAGL0A00231g    ecy|Ecym_1436   28.76   845 398 32  2   789 7   704 4e-56    203
cgl|CAGL0A00231g    ecy|Ecym_6400   34.18   79  48  3   590 666 418 494 0.034   35.0
cgl|CAGL0A00231g    ecy|Ecym_4243   39.02   41  25  0   554 594 222 262 2.1 29.3

Your code:

from blast import parse

fh = open('test.outfmt6')
for blast_record in parse(fh):
    print('query id: {}'.format(blast_record.qid))
    for hit in blast_record.hits:
        for hsp in hit:
            print('****Alignment****')
            print('sequence:', hsp.sid)
            print('length:', hsp.length)
            print('e value:', hsp.evalue)
fh.close()

Output:

query id: cgl|CAGL0A00187g
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 90)
('e value:', 0.0001)
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 44)
('e value:', 0.0007)
****Alignment****
('sequence:', u'ecy|Ecym_4273')
('length:', 84)
('e value:', 0.64)
query id: cgl|CAGL0A00209g
****Alignment****
('sequence:', u'ecy|Ecym_1435')
('length:', 751)
('e value:', 0.0)
****Alignment****
('sequence:', u'ecy|Ecym_5414')
('length:', 89)
('e value:', 7.0)
query id: cgl|CAGL0A00231g
****Alignment****
('sequence:', u'ecy|Ecym_1436')
('length:', 845)
('e value:', 4e-56)
****Alignment****
('sequence:', u'ecy|Ecym_6400')
('length:', 79)
('e value:', 0.034)
****Alignment****
('sequence:', u'ecy|Ecym_4243')
('length:', 41)
('e value:', 2.1)
ADD COMMENT
1
Entering edit mode

That is awesome !

I was doing the exact same steps you did before I realised that Biopython may already implemented it... But yours is way better, I think I may use it a lot ! thanks again

ADD REPLY
0
Entering edit mode

Very Helpful! Is there any way we can sort based on query coordinates (i.e hsp.qid, hspp.qstart and hsp.qend) using your this module?

ADD REPLY
0
Entering edit mode

Unfortunately you can't sort the blast records by query coordinate. However, you can sort it via Linux sort. For example, sort -k 1,1 -k 7,7n blast.txt will sort hsps by hsp.qstart.

ADD REPLY

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6