I have a BLAST tabular output with millions of hits.Query is my sequence and subject is a protein hit. I am interested in finding the subjects corresponding to the same query that do not overlap. If I know the subject start and end sites it becomes possible to do; if S1 < E2 < S2 and E1 < S2 < E2 OR S2 - E1 > 0 Basically, since there are many hits and number of subjects vary, I may understand the algorithm, but find it difficult to implement in code. For example,my input file
query subject start end
cont20 EMT34567 2 115
cont20 EMT28057 238 345
cont31 EMT45002 112 980
cont31 EMT45002 333 567
Desired output (I want the program to print only the query and subject names that do not overlap)
cont20 EMT28057
cont20 EMT34567
I have started the script using regex, but I am not sure how to continue or if this is a right way
import re
output=open('result.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=re.split(r'\t+',line.strip())
query=new_list[0]
subject=new_list[1]
s_start=new_list[8]
s_end=new_list[9]
So what you want is: for every query (cont...) get non overlapping subjects (EMT...)?
yes,exactly....