Hi all, I am very sorry if this question is very trivial. I have a list of fasta coordinates for the multiple sequence fasta file. For example
Id start position End position
1398 4 8
1398 5 10
1398 12 15
1756 25 30
1756 28 35
Is it possible to convert the top table as given below?
ID start Position End position
1398 4 10
1398 12 15
1756 25 35
Here is my actual part of the raw data
So basically I am looking for a way that gives me non-overlapping fasta coordinates from the over lapping coordinates. I saw a very similar question on this link : http://www.biostars.org/post/show/7825/how-to-get-non-overlapping-coordinates-from-a-list-that-contains-overlapping-coordinates/ but it does not help in my case as I have unique IDs instead of chromosome numbers. I tried to convert it to bed format using sortBed but it seems to give me this error "Error: malformed GFF entry at line 1. Start was greater than end. Exiting"
I got these columns from Blast results which are based on the contig assembly. Now I want to form a sequence based on those blast results. I hope this question is clear.
Benm's solution from http://www.biostars.org/post/show/7825/how-to-get-non-overlapping-coordinates-from-a-list-that-contains-overlapping-coordinates/ works perfectly on your dataset. Did you try it?
I tried but it should not report/output sub-region within the region. It means that, we still get region 5-10 even if it covered under region 4-12. According to his solution, it should take that into account but it does not seem so. I might be missing some tricks.
Could you give an example (editing your question for instance) of what output you get with Benm's solution? I get exactly the solution you describe in your question...
Benm's solution works for some coordinates but then on some parts it does not work. Here is the input file http://pastie.org/4221091 Here is the output file http://pastie.org/4221095
I think I understood. I edited my code and added
if ($data[2] < $new[2])
. I think now it should work on your sorted file and will not report so called underlaps. Please have a look.Vikas, your solution works for some coordinates just as Benm's solution. Please see my input and output file in the previous reply. Thanks again.
There are 2 problems in your input file because of which you are not getting desired results with my code. One is that the file is not sorted and for that I have already mentioned to use sort command on first 2 columns. Second is that you have negative strands also. Do you want to treat them separately or you want to merge them also.
Eg- If you have
You want to merge them or want to keep them separately?
I want to merge them.
Please see my edit and let me know if it works.
It works pretty well. Thank you.
My textfile does not contain any header line. I am using it to illustrate my question. Sorry about the confusion.