Question

Quick Programming Challenge: Calculate Common And Unique Regions From A List Of Chromosome Segments

6

Entering edit mode

15.3 years ago

Khader Shameer 18k

I have a files with data like this for various chromosomes in bp

File 1: 
    24328000-29166946
    25388351-27114603
    22310186-25239677
    28511024-29638159
    23729632-26385029

Solution in Perl code: My code reports no common region in this sections

File 2: 
    35388351-55114603
    37310186-52396773
    38511024-48638500 
    32729632-48638502
    17446360-51119526

Solution using Perl code: (as of now it reports only common region):

38511024 -  48638500

Looking for a solution to find the unique and common regions in Perl or other languages.

programming • 6.2k views

ADD COMMENT • link updated 20 months ago by Ram 45k • written 15.3 years ago by Khader Shameer 18k

2

Entering edit mode

@Khader, could you fix your indentation?

Are you looking for exact matches between the 2 files? if so, use a set.

Are you looking for overlaps? if so try this in bx-python and see and see Istvan's examples in this thread.

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 15.3 years ago by brentp 24k

1

Entering edit mode

why don't you write a doctest to show how the program should be called, and which outputs are expected for a certain input? http://docs.python.org/library/doctest.html

ADD REPLY • link 15.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Thanks. The links looks interesting. I will check them. I moved the code to pastebin. I am not looking for exact match between files. I have various files like 1, 2...I am trying to identify the common regions and unique regions among different segments in a given file (one file at a time).

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

example of doctest: http://pastebin.com/s5knZ9VR is it correct?

ADD REPLY • link 15.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

@giovanni : thanks for the suggestion. You think we need a doctest here ? It is such a simple problem. I have many input files and I just showed two random input files that I have. I am looking for a general solution that can report unique and common region - if they are available. Looks like it is not a quick challenge, should I change the title of my question ?

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

The terms of this problem should be a bit better specified. I am unsure of what you mean by unique and common. Could you redefine the problem in terms of overlap of the intervals?

ADD REPLY • link 15.3 years ago by Istvan Albert 102k

0

Entering edit mode

I don't understand the question. What is the 'common' region ? if no segments overlap what sould be the result (all unique ?) ? ... if all segments overlap but one ?, if A overlap B and B overlap C but C doesn't overlap A ? etc...

ADD REPLY • link 15.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Khader: Is it correct for 2 segments (lets say 1-5 and 3-7), that common segment is 3-5, and uniques are 1-3 on the 1st segment, and 5-7 on the 2nd?

ADD REPLY • link 15.3 years ago by Yuri ★ 1.7k

0

Entering edit mode

@Istavan Thanks for the suggestion. Input file is a set of chromosome intervals. Output 1 : I need to find the overlap of intervals in bp between the different intervals. Output 2 :Report unique regions. Second part i have not implemented in my code yet.

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

@Pierre : If no overlap - all are unique. If there are partial overlap(thanks for pointing this) report as unique 1, unique 2 etc.

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

@Yuri : that's exactly what am looking for - but things get a bit complex when there is partial overlap as pointed by Pierre.

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

@khader: a doctest would be just a way to show what is the expected output, what is the solution you expect from an answer. Moreover, I would simplify the input files that you provide: instead of 24328000-29166946 I would use 10000-20000, which is a lot easier to read and thus easier to test. Have a look at the doctest I posted, there are three cases that you should define (e.g. are two segments overlapping when one begins on the same base where the other ends?)

ADD REPLY • link 15.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

let's see if later today I can answer this..

ADD REPLY • link 15.3 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2010-04-27

5

Entering edit mode

15.3 years ago

Casbon ★ 3.3k

import string, sys
data = map(string.split, sys.stdin)
starts = sorted(map(lambda x: int(x[0]), data))
ends = sorted(map(lambda x: int(x[-1]), data))
borders = sorted(starts + ends)

for (left, right) in zip(borders, borders[1:]):
    covering_count = len([x for x in starts if x<= left]) - len([x for x in ends if x<= left])
    print left, right, covering_count

$ python so.py < input 
22310186 23729632 1
23729632 24328000 2
24328000 25239677 3
25239677 25388351 2
25388351 26385029 3
26385029 27114603 2
27114603 28511024 1
28511024 29166946 2
29166946 29638159 1

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.3 years ago by Casbon ★ 3.3k

3

Entering edit mode

maybe you should just type out what the answer should be for this example

ADD REPLY • link 15.3 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Your code works, but am afraid the output is not the way that I expected.

I am looking for output in 2 levels : Unique segment (if any) among the 5 regions in bp Commmon region (if any) among the 5 regions in bp

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

Added my current code to the question. It reports no common region, so i have added another example.

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

@Jeremy : Any interesting solution via R or any BioC modules that can do this ?

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

0

Entering edit mode

P.S. I made a small change to input file after @Casbon provided his solution.

ADD REPLY • link 15.3 years ago by Khader Shameer 18k

Ram · Answer 2 · 2010-04-27

Here is my interpretation in R

library("IRanges")
mytab<-read.table("file2.txt",sep="-",col.names=c("s","e"))
myrange<-IRanges(start=mytab$s,end=mytab$e)

#take whatever depth is given and print ranges with that depth
printRangesByDepth<-function(d){
pos<-which(coverage(myrange)==d)
  write.table(as.data.frame(reduce(IRanges(pos,pos)))[,c(1,2)],row.names=FALSE,col.names=FALSE,sep="-")}

#which locations are common to all, i.e. have depth equal to sum of all ranges
cat("Common:\n")
printRangesByDepth(length(myrange))

#which locations are unique? depth of one
cat("Unique:\n")
printRangesByDepth(1)

file1:

Common:
Unique:
22310186-23729631
27114604-28511023
29166947-29638159

file2:

Common:
38511024-48638500
Unique:
17446360-32729631
52396774-55114603