I have a list of start and end position information from protein database, and some sequences are overlapped with each other because of the existence of isoforms. I want to remove overlapped sequences and keep the longest one. How could I achieve this?
Data is like:
KN150702.1 512 66743
KN150702.1 4526 75660
KN150702.1 51685 52551
KN150702.1 75503 111816
KN150702.1 126256 146772
KN150702.1 155049 175903
KN150702.1 177161 211884
KN150703.1 4605 14526
KN150703.1 16536 18921
KN150703.1 16536 18879
KN150703.1 23158 47525
KN150703.1 36969 40261
KN150703.1 42415 46815
And the results should be:
KN150702.1 4526 75660
KN150702.1 126256 146772
KN150702.1 155049 175903
KN150702.1 177161 211884
KN150703.1 4605 14526
KN150703.1 16536 18921
KN150703.1 23158 47525
Very minor nitpick: the pipe character can finish the line, that is, it does not need to be followed by an escaped newline.
Thanks for your answer, but actually I do not want to merge them. I only need to keep the longest one and remove the others.
I update the answer. Accept the answer if it works so that it won't bump again in future.