So I needed to retrieve the number of hits for each query sequence from tabular blast output (with comments).
I used the script given here but changed a little: A: write a Python program to determine the average number of hits per sequence in
fh = open('BLASTED.txt')
oh = open('BLASTED.txt.out', 'w')
for qid, hsps in groupby(fh, lambda l: l.split()[0]):
if qid.startswith('#'): continue
hits = len(set([l.split()[1] for l in hsps]))
hits_no += hits
queries_no += 1
oh.write('{0}\t{1}\n'.format(qid, hits))
oh.close()
fh.close()
So the relevant part of my script is:
blast = open('testing.txt')
output = open("hitcount.txt", 'w')
queries_no = 0
lowhitqueries_no = 0
for qid, hsps in groupby(blast, lambda l: l.split()[0]):
if qid.startswith('#'): continue
hits = len(set([l.split()[1] for l in hsps]))
queries_no += 1
#Used "print qid, hits" here to check what kind of hits I'm getting
if hits<2:
lowhitqueries_no +=1
output.write('{0}\t{1}\n'.format(qid, hits))
blast.close()
However, I noticed that the value for hits does not match what's actually in testing.txt. There are some query sequences with thousands of hits and print hits will only return a value of maybe 24.
How is this script meant to work and why is it failing?
EDIT: I found this method that I think is much simpler
import re
blast = open("testing.txt")
output = open("hitcount" + str(dataset) + ".txt", 'w')
lines = blast.readlines()
queries_no = 0
lowhitqueries_no = 0
for index, line in enumerate(lines):
if re.match("# Query:", line):
queries_no +=1
hits = int(re.search(r'\d+', lines[index+3]).group())
qid = line[9:].rstrip()
print qid, hits #Just to check it's working
if hits<2:
lowhitqueries_no +=1
output.write('{0}\t{1}\n'.format(qid, hits))
blast.close()
Look at how much less space it uses up
You're just looking for the number of times a query id shows up in your blast report? I'm not sure why you tagged Biopython, I don't see that you're importing anything from the module, unless you ran the blast via biopython.
This will give you the query ids and counts per id:
I think starting off writing your own program, instead of trying to adapt something written by someone else would be a better learning experience, unless this is homework and you have to use pythong/biopython.
I tagged biopython because I wanted to use biopython to do it, but I found a solution anyway using re.match() to find query id lines, and readlines() and index to find the lines which give matching hit counts
It's in the edit
I think I found that same commandline thing written slightly differently but it didn't work. This one does create a file but its format isn't as I'd want it. The biopython script lets me decide what goes in there and in what format because I have a filtering step too
Ok. Instead of changing the question title to 'resolved', provide your own answer with what you did, and why it worked with what you are looking for. Then you accept your answer.
My answer is in the edit. If anyone skips straight to the answers without reading the issue then that's their fault. Accepting my own answer sounds weird but I'll do it this once