Hi guys,
I`m doing blastx on linux command line, my output is on outmft 6 because i want to annotate the results, the problem is that in the current release of blastx the querys whithout hit does not appear on the table result, somebody know how get that sequences in other file? I have read the manual and it looks like there is no option for doing that, somebody have a perl or python script for that? or some ideas?
Thanks everybody!!
import sys
from Bio import SeqIO
faFile = open(sys.argv[1],'r')
queries = set([strrecord.id) for record in SeqIO.parse(faFile,'fasta')])
hits = set([x.split()[0] for x in open(sys.argv[2],'r').read().strip().split('\n')])
noHits = queries - hits
print '\n'.join(noHits)
It might not work because blast may report shorter query sequence IDs than full headers. In this case, I would need to see a few headers and blast output lines so the first process substitution could be modified accordingly.
ussage python python_script.py file1 file2
from __future__ import division
import sys
from Bio import SeqIO
import os
file1 = open(sys.argv[1],'r')
lista1 = []
for line in file1:
line = line.rstrip('\n')
lista1.append(line)
file1.close()
file2 = open(sys.argv[2],'r')
lista2 = []
for line in file2:
line = line.rstrip('\n')
lista2.append(line)
file2.close()
same = set(lista1).intersection(set(lista2))
file_out = open('some_output_file.txt', 'w')
for line in same:
file_out.write(line)
file_out.close()
use
comm
with your input+output to find the missing queriesMmm.. I`m not sure if that works with tabular format :-S
extract the query name from both files. sort. compute the intersection with
comm
. And, yes, it works.More complicated solution, but works perfect too :-) I posted a python script in other comment to do that. Thanks Pierre :-)