I have a files with data like this for various chromosomes in bp
File 1:
24328000-29166946
25388351-27114603
22310186-25239677
28511024-29638159
23729632-26385029
Solution in Perl code: My code reports no common region in this sections
File 2:
35388351-55114603
37310186-52396773
38511024-48638500
32729632-48638502
17446360-51119526
Solution using Perl code: (as of now it reports only common region):
38511024 - 48638500
Looking for a solution to find the unique and common regions in Perl or other languages.
@Khader, could you fix your indentation?
Are you looking for exact matches between the 2 files? if so, use a set.
Are you looking for overlaps? if so try this in bx-python and see and see Istvan's examples in this thread.
why don't you write a doctest to show how the program should be called, and which outputs are expected for a certain input? http://docs.python.org/library/doctest.html
Thanks. The links looks interesting. I will check them. I moved the code to pastebin. I am not looking for exact match between files. I have various files like 1, 2...I am trying to identify the common regions and unique regions among different segments in a given file (one file at a time).
example of doctest: http://pastebin.com/s5knZ9VR is it correct?
@giovanni : thanks for the suggestion. You think we need a doctest here ? It is such a simple problem. I have many input files and I just showed two random input files that I have. I am looking for a general solution that can report unique and common region - if they are available. Looks like it is not a quick challenge, should I change the title of my question ?
The terms of this problem should be a bit better specified. I am unsure of what you mean by unique and common. Could you redefine the problem in terms of overlap of the intervals?
I don't understand the question. What is the 'common' region ? if no segments overlap what sould be the result (all unique ?) ? ... if all segments overlap but one ?, if A overlap B and B overlap C but C doesn't overlap A ? etc...
@Khader: Is it correct for 2 segments (lets say 1-5 and 3-7), that common segment is 3-5, and uniques are 1-3 on the 1st segment, and 5-7 on the 2nd?
@Istavan Thanks for the suggestion. Input file is a set of chromosome intervals. Output 1 : I need to find the overlap of intervals in bp between the different intervals. Output 2 :Report unique regions. Second part i have not implemented in my code yet.
@Pierre : If no overlap - all are unique. If there are partial overlap(thanks for pointing this) report as unique 1, unique 2 etc.
@Yuri : that's exactly what am looking for - but things get a bit complex when there is partial overlap as pointed by Pierre.
@khader: a doctest would be just a way to show what is the expected output, what is the solution you expect from an answer. Moreover, I would simplify the input files that you provide: instead of 24328000-29166946 I would use 10000-20000, which is a lot easier to read and thus easier to test. Have a look at the doctest I posted, there are three cases that you should define (e.g. are two segments overlapping when one begins on the same base where the other ends?)
let's see if later today I can answer this..