Finding Overlapping, Non Overlapping Hits
1
0
Entering edit mode
10.7 years ago
User000 ▴ 710

I have a BLAST tabular output with millions of hits.Con is my contig and P is a protein hit. I am interested in differentiating, for every contig, hits corresponding to the 3 cases illustrated below:

  1. Contigs with both overlapping and non overlapping hits
  2. only overlapping hits
  3. only non overlapping hits.

They should all be printed in 3 separate new files, and contigs that are in file 1 should not be in file 2,3 and etc. How to do that?

con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
   p1---- p2 ------    p4---
               p3-----
con2 --------------------- (only overlapping)  con3 ----------------(only non overlp)
   p1 -----                                       p1 ----  p2 -----
      p2 -------
          p3 -----

If I know the Protein start and end sites it becomes possible to identify overlap or non overlap; if S1 < E2 < S2 and E1 < S2 < E2 OR S2 - E1 > 0.

My input file, i.e.

contig  protein start end
con1    P1    481    931
con1    P2    140    602
con1    P3    232    548
con2    P4    335    406
con2    P5    642    732
con2    P6    2282    2433
con2    P7    729    812
con3    P8    17    148
con3    P9    289    45

The script (this prints me only the hits that do not overlap)

from itertools import groupby

def nonoverlapping(hits):
    """Returns a list of non-overlapping hits."""
    nonover = []
    overst =  False
    for I in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        if c[2] > p[3]:
            if not overst: nonover.append(p)
            nonover.append(c)
            overst = True  
    return nonover

fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    if len(hits) > 1: 
        hits.sort(key=lambda x: x[2])
        for hit in nonoverlapping(hits):
            oh.write('\t'.join([str(f) for f in hit])+'\n')
python • 4.2k views
ADD COMMENT
4
Entering edit mode
10.7 years ago

UPDATED:

from itertools import groupby

def iter_hits(hits):
    for i in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        yield p, c

def is_overlap(hits):
    for p, c in iter_hits(hits):
        if c[2] < p[3]:
            return True

def is_nonoverlap(hits):
    for p, c in iter_hits(hits):
        if c[2] > p[3]:
            return True



fh = open('file.txt')
fh.next()
oh1 = open('mixed.txt', 'w')
oh2 = open('overlaps.txt', 'w')
oh3 = open('nonoverlaps.txt', 'w')

for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    hits.sort(key=lambda x: x[2])
    if len(hits)==1:
        oh = oh3
    elif is_nonoverlap(hits):
        oh = oh3
        if is_overlap(hits):
            oh = oh1
    else:
        oh = oh2

    for hit in hits:
        oh.write('\t'.join([str(f) for f in hit])+'\n')

nonoverlaps.txt:

con3    P8    17    148
con3    P9    289    45

overlaps.txt:

con1    P2    140    602
con1    P3    232    548
con1    P1    481    931

mixed.txt:

con2    P4    335    406
con2    P5    642    732
con2    P7    729    812
con2    P6    2282    2433
ADD COMMENT
0
Entering edit mode

Hi Andrzej! you are becoming my favourite user-tutor! Thank you, great answer, but, basically, may be I didnt explain well, I need to differentiate my file containing all contigs into 3 different files: 1. that contains mixed overlap and non-overlap 2. contains exclusively overlapping hits 3. contains only non overlapping hits. Also, if contig 1 is already present in file 1, it should not appear anymore in any other files. For example, in my example, all con1 should go to overlapping file, all con 2 to mixed since it has both overlaps and non overlaps, and con3 to non overlapping file..

ADD REPLY
1
Entering edit mode

Hi there:) I'm glad to be helpful. I updated my answer.

ADD REPLY
1
Entering edit mode

thank you a lot! I wish I could put +100 upvotes

ADD REPLY

Login before adding your answer.

Traffic: 1533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6