Quick Programming Challenge: Calculate Common And Unique Regions From A List Of Chromosome Segments
2
6
Entering edit mode
14.7 years ago

I have a files with data like this for various chromosomes in bp

File 1: 
    24328000-29166946
    25388351-27114603
    22310186-25239677
    28511024-29638159
    23729632-26385029

Solution in Perl code: My code reports no common region in this sections

File 2: 
    35388351-55114603
    37310186-52396773
    38511024-48638500 
    32729632-48638502
    17446360-51119526

Solution using Perl code: (as of now it reports only common region):

38511024 -  48638500

Looking for a solution to find the unique and common regions in Perl or other languages.

programming • 5.3k views
ADD COMMENT
2
Entering edit mode

@Khader, could you fix your indentation?

Are you looking for exact matches between the 2 files? if so, use a set.

Are you looking for overlaps? if so try this in bx-python and see and see Istvan's examples in this thread.

ADD REPLY
1
Entering edit mode

why don't you write a doctest to show how the program should be called, and which outputs are expected for a certain input? http://docs.python.org/library/doctest.html

ADD REPLY
0
Entering edit mode

Thanks. The links looks interesting. I will check them. I moved the code to pastebin. I am not looking for exact match between files. I have various files like 1, 2...I am trying to identify the common regions and unique regions among different segments in a given file (one file at a time).

ADD REPLY
0
Entering edit mode

example of doctest: http://pastebin.com/s5knZ9VR is it correct?

ADD REPLY
0
Entering edit mode

@giovanni : thanks for the suggestion. You think we need a doctest here ? It is such a simple problem. I have many input files and I just showed two random input files that I have. I am looking for a general solution that can report unique and common region - if they are available. Looks like it is not a quick challenge, should I change the title of my question ?

ADD REPLY
0
Entering edit mode

The terms of this problem should be a bit better specified. I am unsure of what you mean by unique and common. Could you redefine the problem in terms of overlap of the intervals?

ADD REPLY
0
Entering edit mode

I don't understand the question. What is the 'common' region ? if no segments overlap what sould be the result (all unique ?) ? ... if all segments overlap but one ?, if A overlap B and B overlap C but C doesn't overlap A ? etc...

ADD REPLY
0
Entering edit mode

@Khader: Is it correct for 2 segments (lets say 1-5 and 3-7), that common segment is 3-5, and uniques are 1-3 on the 1st segment, and 5-7 on the 2nd?

ADD REPLY
0
Entering edit mode

@Istavan Thanks for the suggestion. Input file is a set of chromosome intervals. Output 1 : I need to find the overlap of intervals in bp between the different intervals. Output 2 :Report unique regions. Second part i have not implemented in my code yet.

ADD REPLY
0
Entering edit mode

@Pierre : If no overlap - all are unique. If there are partial overlap(thanks for pointing this) report as unique 1, unique 2 etc.

ADD REPLY
0
Entering edit mode

@Yuri : that's exactly what am looking for - but things get a bit complex when there is partial overlap as pointed by Pierre.

ADD REPLY
0
Entering edit mode

@khader: a doctest would be just a way to show what is the expected output, what is the solution you expect from an answer. Moreover, I would simplify the input files that you provide: instead of 24328000-29166946 I would use 10000-20000, which is a lot easier to read and thus easier to test. Have a look at the doctest I posted, there are three cases that you should define (e.g. are two segments overlapping when one begins on the same base where the other ends?)

ADD REPLY
0
Entering edit mode

let's see if later today I can answer this..

ADD REPLY
5
Entering edit mode
14.7 years ago
Casbon ★ 3.3k
import string, sys
data = map(string.split, sys.stdin)
starts = sorted(map(lambda x: int(x[0]), data))
ends = sorted(map(lambda x: int(x[-1]), data))
borders = sorted(starts + ends)

for (left, right) in zip(borders, borders[1:]):
    covering_count = len([x for x in starts if x<= left]) - len([x for x in ends if x<= left])
    print left, right, covering_count

$ python so.py < input 
22310186 23729632 1
23729632 24328000 2
24328000 25239677 3
25239677 25388351 2
25388351 26385029 3
26385029 27114603 2
27114603 28511024 1
28511024 29166946 2
29166946 29638159 1
ADD COMMENT
3
Entering edit mode

maybe you should just type out what the answer should be for this example

ADD REPLY
0
Entering edit mode

Your code works, but am afraid the output is not the way that I expected.

I am looking for output in 2 levels : Unique segment (if any) among the 5 regions in bp Commmon region (if any) among the 5 regions in bp

ADD REPLY
0
Entering edit mode

Added my current code to the question. It reports no common region, so i have added another example.

ADD REPLY
0
Entering edit mode

@Jeremy : Any interesting solution via R or any BioC modules that can do this ?

ADD REPLY
0
Entering edit mode

P.S. I made a small change to input file after @Casbon provided his solution.

ADD REPLY
5
Entering edit mode
14.7 years ago

Here is my interpretation in R

library("IRanges")
mytab<-read.table("file2.txt",sep="-",col.names=c("s","e"))
myrange<-IRanges(start=mytab$s,end=mytab$e)

#take whatever depth is given and print ranges with that depth
printRangesByDepth<-function(d){
pos<-which(coverage(myrange)==d)
  write.table(as.data.frame(reduce(IRanges(pos,pos)))[,c(1,2)],row.names=FALSE,col.names=FALSE,sep="-")}

#which locations are common to all, i.e. have depth equal to sum of all ranges
cat("Common:\n")
printRangesByDepth(length(myrange))

#which locations are unique? depth of one
cat("Unique:\n")
printRangesByDepth(1)

file1:

Common:
Unique:
22310186-23729631
27114604-28511023
29166947-29638159

file2:

Common:
38511024-48638500
Unique:
17446360-32729631
52396774-55114603
ADD COMMENT
0
Entering edit mode

Awesome ! I didn't know about IRanges. Thanks. (Please, Can I vote it up twice :) ?)

ADD REPLY

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6