Hi
I'm trying to parse a blast tabular format file (-outfmt 6
) to access it's individual elements for further manipulation .
I need to retrieve the qseqid, qstart, qend, qlen, qseq, sseqid, sstart, send, slen, sseq
and pident
for each hit and here's the script :
blast_qresult=SearchIO.parse('Aegilops_Brachypodium.txt','blast-tab')
for i,blast_hit in enumerate(blast_qresult) :
for m,blast_hsp in enumerate(blast_hit):
for n in range(len(blast_hsp)):
print("Query ID: %s"%blast_hit.id)
print("Query start: %s"%blast_hsp[n].query_start)
print("Query end: %s"%blast_hsp[n].query_end)
print("Query seq: %s"%blast_hsp[n].query)
print("Hit ID: %s"%blast_hit[m].id)
print("Hit start: %s"%blast_hsp[n].hit_start)
print("Hit end: %s"%blast_hsp[n].hit_end)
print("Hit seq: %s"%blast_hsp[n].hit)
print("Identity: %s"%blast_hsp[n].ident_pct)
Till here it does produce some results , and halts at certain hit throwing this error:
builtins.ValueError: Hit 'BRADI2G54940.1' already present in this QueryResult.
And when I try to access the other individual values like:
print(blast_hsp[n].seq_len)
print(blast_hit[m].seq_len)
They give these errors respectively :
builtins.AttributeError: 'HSP' object has no attribute 'seq_len'
builtins.AttributeError: 'Hit' object has no attribute 'seq_len'
My questions are :
- How can I access the
qlen
andslen
attributes? Why does
print("Query seq: %s"%blast_hsp[n].query) print("Hit seq: %s"%blast_hsp[n].hit)
give this output
Query seq: None Hit seq: None
How can I retrieve the sequences for both query and hit?
Both
print(blast_hit[m].id) print(blast_hit[n].id)
give the sseqid, which one of them is more accurate?
How can I overcome the error that complains about redundant Hit ID's in the tabular report?
- I noticed that the qstart and sstart returns values decremented by one (values -1) indices ... why is that? Is it safe to use them for manipulation as is or shall I increment them by 1?
I would be grateful for any help ... thanks.
What version of Biopython do you have? Can you share the data file causing the problem? e.g. using http://gist.github.com
I'm using Biopython 1.64 .
Here's a portion of the file that's originally contains 13072 hits:
https://gist.github.com/Baraa
Edit : modified the portion of the file where the hit causing the error locates.
That sample does not include the problematic identifier
BRADI2G54940.1
so probably won't help :(Speaking in general ... how can I overcome this problem?
Edit: updated the file that contains the hit in question.
I can now see
BRADI2G54940.1
in the file once. Can you run this at the command line to see where else it appears in the file?:grep "BRADI2G54940.1" your_blast_file
Perhaps also try with some neighbouring lines for context:
grep -C 2 "BRADI2G54940.1" your_blast_file
I didn't have to run that command ... the error was gone - suddenly - for this particular file
Brachyoidium_Japonica.txt
, but not for theHoredum_Japonica.txt
!Previously , the interpreter raised the error complaining about the redundant hit ID BRADI2G54940.1 , but now blast produces the result containing one hit with BRADI2G54940.1 ID !!
BTW ... I haven't changed anything in my script lately !!
I'm really confused now ... what a strange blast behavior .
Would you please explain to me what that command is used for?