Blast Result: Sorting non overlapping hits and choosing the extremes for overlapping ones
2
0
Entering edit mode
2.8 years ago
Sasha ▴ 10

I have a huge Blast output of the results like this:

1.83155e-63 24679461 24680333

7.37596e-64 24679455 24680312

9.14344e-65 24679455 24680333

6.86568e-72 24679455 24680333

6.45089e-69 24679455 24680333

4.49086e-56 24679455 24680312

1.78896e-52 35167152 35167547

2.57611e-51 35167209 35167547

The first column is e-value, second one is start of the match and third one is the end. Now I need to only choose the ones that don't overlap and if some are overlapping I need to choose only extremes. For example:

8:25

10:20

7:24

30:50

The results would be:

7:25

30:50

Thank you for any help!

Blast Biopython Python • 1.1k views
ADD COMMENT
1
Entering edit mode
2.7 years ago
Sasha ▴ 10

Used a little different approach, which also seems to be working.

input = [
    (24679461, 24680333),
    (24679455, 24680312),
    (24679455, 24680333),
    (24679455, 24680333),
    (24679455, 24680333),
    (24679455, 24680312),
    (24679464, 24680333),
    (24679452, 24680333),
    (24679464, 24680339),
    (24679461, 24680333),
    (24679461, 24680333),
    (24679449, 24680333),
    (24679461, 24680333),
    (24679464, 24680333),
    (24679452, 24680333),
    (35167152, 35167547),
    (35167209, 35167547),
]

delta = 100
output = {}

# 24679449 - 24680333
# 35167152 - 35167547

for r in input:
    start = r[0]
    end = r[1]

    # if already have a range with the same start, just extend if new range is longer
    if start in output:
        print(f"Extending range: start={start}, end={max(output[start], end)}")
        output[start] = max(output[start], end)
    else:
        merged = False

        # if we don't have a range starting at the same location, look for a range that is 'delta' distance away in either direction
        for i in range(start - delta, start + delta):
            if i in output:

                # if we found a range with lower starting position, extend it if the new one is longer
                if i < start:
                    print(f"Extending upper range: start={i}, end={max(output[i], end)}")
                    output[i] = max(output[i], end)
                else:
                    # otherwise if we found a range with higher starting position we need to replace it with the starting position of the new one
                    # and keep the "longer" end (and also delete the old range which has now been "consumed")
                    print(f"Extending lower and upper range: start={start}, end={max(output[i], end)}")
                    output[start] = max(output[i], end)
                    del output[i]

                merged = True
                break

        # if we haven't merged the new range with anything just save it
        if not merged:
            print(f"Adding new range: start={start}, end={end}")
            output[start] = end

print(output)
ADD COMMENT
0
Entering edit mode

An FYI..I don't know if Biostars messed it up; however your code doesn't work as posted. Your for r in input: for loop isn't indented correctly. And even when that is fixed, the output printed is just the last range.

Adding new range: start=35167209, end=35167547
{35167209: 35167547}

Might be fine on your end and this was all caused by formatting.

ADD REPLY
0
Entering edit mode

I think it should be working now.

ADD REPLY
1
Entering edit mode

I see it was definitely mis-indented earlier. Thought it was worth pointing out as I didn't want you to come back here later and find it not working.

ADD REPLY
2
Entering edit mode
2.8 years ago
Wayne ★ 2.1k

This should provide what you've asked for. It could be a little less clunky with more refining / updating to use .format() or f-strings .

import pandas as pd
df = pd.read_csv("data.txt",sep=" ",names=["e-value","start","end"])
def range_extract(lst):
    'Yield 2-tuple ranges or 1-tuple single elements from list of increasing'
    'ints; interval making code modified from'  
    'https://www.rosettacode.org/wiki/Range_extraction#Python'
    lenlst = len(lst)
    i = 0
    while i< lenlst:
        low = lst[i]
        while i <lenlst-1 and lst[i]+1 == lst[i+1]: i +=1
        hi = lst[i]
        if hi - low >= 1:    #<---MAIN DIFFERENCE
            yield (low, hi)
        else:
            yield (low,)
        i += 1

def printr(ranges):
    print( '\n'.join( (('%i:%i' % r) if len(r) == 2 else '%i' % r)
                     for r in ranges ) )
def expand_all_ranges_to_each_position(the_min,the_max):
    '''
    Takes the minimum and the max position and returns a list of all the positions
    in between as well as both the boundaries.
    '''
    return list(range(the_min,the_max+1))
all_positions = []
for row in df.itertuples():
    all_positions.extend(expand_all_ranges_to_each_position(min(row.start,row.end),max(row.start,row.end)))
all_positions = sorted(set(all_positions)) #get unique individual positions sorted; `set()` insures unique positions
for lst in [all_positions]:
    #print(list(range_extract(lst)))
    printr(range_extract(sorted(lst)))

A Jupyter notebook with related code run with some of the supplied data & example is here.
Click here to launch that notebook in an active, temporary Jupyter session served via MyBinder.org with the environment already having Python and the module Pandas, for panel-type data/dataframe-handling installed and working.

ADD COMMENT

Login before adding your answer.

Traffic: 1593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6