Entering edit mode
10.7 years ago
User000
▴
710
I have a BLAST tabular output with millions of hits.Con is my contig and P is a protein hit. I am interested in differentiating, for every contig, hits corresponding to the 3 cases illustrated below:
- Contigs with both overlapping and non overlapping hits
- only overlapping hits
- only non overlapping hits.
They should all be printed in 3 separate new files, and contigs that are in file 1 should not be in file 2,3 and etc. How to do that?
con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
p1---- p2 ------ p4---
p3-----
con2 --------------------- (only overlapping) con3 ----------------(only non overlp)
p1 ----- p1 ---- p2 -----
p2 -------
p3 -----
If I know the Protein start and end sites it becomes possible to identify overlap or non overlap; if S1 < E2 < S2 and E1 < S2 < E2 OR S2 - E1 > 0.
My input file, i.e.
contig protein start end
con1 P1 481 931
con1 P2 140 602
con1 P3 232 548
con2 P4 335 406
con2 P5 642 732
con2 P6 2282 2433
con2 P7 729 812
con3 P8 17 148
con3 P9 289 45
The script (this prints me only the hits that do not overlap)
from itertools import groupby
def nonoverlapping(hits):
"""Returns a list of non-overlapping hits."""
nonover = []
overst = False
for I in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
if c[2] > p[3]:
if not overst: nonover.append(p)
nonover.append(c)
overst = True
return nonover
fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
if len(hits) > 1:
hits.sort(key=lambda x: x[2])
for hit in nonoverlapping(hits):
oh.write('\t'.join([str(f) for f in hit])+'\n')
Hi Andrzej! you are becoming my favourite user-tutor! Thank you, great answer, but, basically, may be I didnt explain well, I need to differentiate my file containing all contigs into 3 different files: 1. that contains mixed overlap and non-overlap 2. contains exclusively overlapping hits 3. contains only non overlapping hits. Also, if contig 1 is already present in file 1, it should not appear anymore in any other files. For example, in my example, all con1 should go to overlapping file, all con 2 to mixed since it has both overlaps and non overlaps, and con3 to non overlapping file..
Hi there:) I'm glad to be helpful. I updated my answer.
thank you a lot! I wish I could put +100 upvotes