I have start and end position of sequences, want to remove redundancy
2
0
Entering edit mode
8.2 years ago
liningouc • 0

I have a list of start and end position information from protein database, and some sequences are overlapped with each other because of the existence of isoforms. I want to remove overlapped sequences and keep the longest one. How could I achieve this?

Data is like:

KN150702.1  512 66743 
KN150702.1  4526    75660 
KN150702.1  51685   52551 
KN150702.1  75503   111816 
KN150702.1  126256  146772 
KN150702.1  155049  175903 
KN150702.1  177161  211884 
KN150703.1  4605    14526 
KN150703.1  16536   18921 
KN150703.1  16536   18879 
KN150703.1  23158   47525 
KN150703.1  36969   40261 
KN150703.1  42415   46815

And the results should be:

KN150702.1  4526    75660
KN150702.1  126256  146772
KN150702.1  155049  175903
KN150702.1  177161  211884
KN150703.1  4605    14526
KN150703.1  16536   18921
KN150703.1  23158   47525
genome sequence • 1.4k views
ADD COMMENT
4
Entering edit mode
8.2 years ago
cat in.bed | sort -k1,1 -k2,2n | mergeBed -i - > out.bed

Makes sure that its a tab delimited file.

Update: I do not want to merge them. I only need to keep the longest one and remove the others

clusterBed -i test.bed | \
awk -v OFS="\t" '{ print $0,$3-$2}' | \
sort -k4,4 -k5,5nr | \
groupBy -full -i - -g 4 -c 5 -o max | \
cut -f1,2,3

output:

KN150702.1  4526    75660
KN150702.1  126256  146772
KN150702.1  155049  175903
KN150702.1  177161  211884
KN150703.1  4605    14526
KN150703.1  16536   18921
KN150703.1  23158   47525
ADD COMMENT
1
Entering edit mode

Very minor nitpick: the pipe character can finish the line, that is, it does not need to be followed by an escaped newline.

ADD REPLY
0
Entering edit mode

Thanks for your answer, but actually I do not want to merge them. I only need to keep the longest one and remove the others.

ADD REPLY
0
Entering edit mode

I update the answer. Accept the answer if it works so that it won't bump again in future.

ADD REPLY
0
Entering edit mode
8.2 years ago
second_exon ▴ 210

If you are still looking for a solution, here is my solution based on Python 3.x Pandas

Your data file with headers (dat.txt)

name    start   end
KN150702.1  512 66743 
KN150702.1  4526    75660 
KN150702.1  51685   52551 
KN150702.1  75503   111816 
KN150702.1  126256  146772 
KN150702.1  155049  175903 
KN150702.1  177161  211884 
KN150703.1  4605    14526 
KN150703.1  16536   18921 
KN150703.1  16536   18879 
KN150703.1  23158   47525 
KN150703.1  36969   40261 
KN150703.1  42415   46815

Python Script

import pandas as pd
df = pd.read_table('dat.txt')
df = df.sort_values('name', ascending=1)
df['diff'] = df['end']-df['start']
final = pd.DataFrame(columns=list(df.columns.values))
while True:
    mask = df['name'] == df.iloc[0, 0]
    df1, df = df[mask], df[~mask]
    df1 = df1.sort_values('start', ascending=1)
    i = 0
    while True:
        if i == len(df1)-1: break
        if int(df1.iloc[i+1, 1]) <= int(df1.iloc[i,2]):
            if int(df1.iloc[i+1, 3]) < int(df1.iloc[i,3]):
                df1.drop(df1.index[[i+1]], inplace=True)
            elif int(df1.iloc[i+1, 3]) > int(df1.iloc[i,3]):
                df1.drop(df1.index[[i]], inplace=True)
        else:
            i+=1
    final = final.append(df1)
    if len(df)==0: break
del final['diff']
print(final)
final.to_csv("non_overlapping.csv", index=False)

This saves the output to a csv file (non_overlapping.csv) in the same working directory.

Output:

name,start,end
KN150702.1,4526.0,75660.0
KN150702.1,126256.0,146772.0
KN150702.1,155049.0,175903.0
KN150702.1,177161.0,211884.0
KN150703.1,4605.0,14526.0
KN150703.1,16536.0,18921.0
KN150703.1,23158.0,47525.0
ADD COMMENT

Login before adding your answer.

Traffic: 3025 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6