Question

Help Me To Count! With Perl

0

Entering edit mode

13.4 years ago

Ar Es ▴ 20

I guess easiest way is to show you what my data looks like in simpler way bp = basepair I have data of various chromosomes with start and end point of cluster tags in each of chromosomes it looks like :

columns = 1: chr 2: start 3:end 4: info(X) 5: info(X) 6:strand

    chr1    101  105   X     X     -
    chr1    101  105   X     X     -
    chr1    101  105   X     X     -
    chr1    102  108   X     X     -
    chr1    102  108   X     X     -
    chr1    102  108   X     X     -
    chr1    106  111   X     X     -
    chr1    112  113   X     X     -
    chr1    112  113   X     X     -
    chr1    112  113   X     X     -
    chr1    112  113   X     X     -
    chr1    113  115   X     X     -
    chr2    114  118   X     X     -  
    chr2    119  121   X     X     -
    chr2    120  123   X     X    -
    chr3    125  130   X     X    -
    chr3    131  132   X     X   -

I need column 1 - 2 -3 - 6

with mergeBed command output is like this :

TSSD_ID        chr       start        end      strand      count
  ID_1         1          101          111       -           7
  ID_2         1          112          113       -           4
  ID_3         1          113          115       -           1
  ID_1         2          114          118       -           1   
  ID_2         2          119          123       -           2 
  ID_1         3          125          132       -           1

but I want to get this output with more details for Identical cordinates

TSSD_ID chr start end strand count

  ID_1         1          101          105       -           3
  ID_1         1          102          108       -           3
  ID_1         1          106          111       -           1
  ID_2         1          112          113       -           4
  ID_3         1          113          115       -           1
  ID_1         2          114          118       -           1   
  ID_2         2          119          123       -           2 
  ID_1         3          125          132       -           1

as you can see I want the more detailed in counting for each merge cordinates like ID_1 in chromosome 1 insteaad of 7 I want cordinates for each merged with ( 3 , 3 ,1 ) counts

counts perl overlap • 4.7k views

ADD COMMENT • link updated 13.4 years ago by Eric Fournier ★ 1.4k • written 13.4 years ago by Ar Es ▴ 20

0

Entering edit mode

post your script and we'll try to help

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

I edited what I could understand. You really need to be clearer on what you are trying to do with the overlaps. Maybe also post what you have of the code so far.

ADD REPLY • link 13.4 years ago by Damian Kao 16k

0

Entering edit mode

I think I understand now. You want to consolidate overlapping tags into one coordinate group and output the number of 2bp overlaps within the consolidated tags? Are your coordinates inclusive? 112-113 and 113-115 do overlap by 1 base if you consider 113 to be inclusive of the coordinate.

ADD REPLY • link 13.4 years ago by Damian Kao 16k

0

Entering edit mode

please be more specific and correct your formatting and typos, it is simply hard to read. please re-write the part where you describe the way you want to group the overlaps, it's hard to grasp what you want and you might not want a solution that is based on guessing that. Further, ovelaps of lenght exactly (==)2, or at least (>=)2. Last point, it makes no sense to do it in perl, it is much easier to solve such problems in R.

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

Hi , I never used R before, I need to count the overlaps >= 2 otherwise dont count them, as DK said 112-113 and 113-115 have over lap but less than 2 and I dont want count them . I hope it make sense !

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20

0

Entering edit mode

I think I got it, simply speaking, you want to: 1. collapse your original intervals to a set of unique coordinate intervals, 2. for each unique interval you count the number of other intervals overlapping with overlap size of >=2. (This is like 2-5 lines in R plus reading the tables.) Please say yes if my interpretation is correct.

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

Yes Michael , thats it but I dont know how to use R to do that also stuck in Perl

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20

0

Entering edit mode

I will neglect your request to write a perl program for you, because it is much more difficult and time consuming to do it in perl and you will end up with an inferior solution. Instead I will provide an example in R. If you are a beginner in perl as well, it will definitely pay off to learn to use the superior toolset for this class of problems, if you share my intention to work solution oriented and not tool-oriented. Sorry, if that sounds patronizing but simply trust me, because I have good experience with both R and Perl.

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

While trying to do so, I noticed that your description is too fuzzy and contradicts your example. Please clean up your question or it cannot be answered.

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

Thanks Micheal, I will try to find out how to solve it and I really appreciate your kindly help

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20

score 1 · Answer 1 · 2012-03-22

1

Entering edit mode

13.4 years ago

Eric Fournier ★ 1.4k

Here's how I would suggest going about solving the problem, in 4 steps:

Use bedtools merge -n -d -2 to merge overlapping features which have at least a 2 base overlap and get a count.
Use awk to rename entries in the merged bed files with unique identifiers.
Use bedtools intersect to get the new identifier and the original entries on the same lines.
Use a simple perl script to go over the intersected output line by line. Go over lines while chr, start and end are the same, incrementing a counter. When coordinates change, output the line (With only the fields that interest you) along with the counter.

ADD COMMENT • link 13.4 years ago by Eric Fournier ★ 1.4k

0

Entering edit mode

I used merged before but It counts total number of overlapped TSSD but I want to count the overlap of tag clusters which are not merged

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20

0

Entering edit mode

"I want to count the overlap of tag clusters which are not merged"

I'm sorry, I do not understand what you mean by that.

ADD REPLY • link 13.4 years ago by Eric Fournier ★ 1.4k

0

Entering edit mode

with merge , it counts all the overlaps with -2 bp over lap like this :

101 102 101 102
101 102 102 103 103 104 103 104 all of them are in 1 range 101-104 when merge them and count as 6 like this : start end count 101 104 6 but I want to get the count for each range in that area (101-104) individually : like 101 102 3 102 103 1 103 104 2 which the total number of them is 6 ! I hope I explain it correctly !

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20

0

Entering edit mode

let me edit my post again and then I will explain it on my post

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20

0

Entering edit mode

Could you please check my post again

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20

0

Entering edit mode

I would strongly advise against answering the question before it is possible to understand it without having to guess.

ADD REPLY • link 13.4 years ago by Michael 56k

0

Entering edit mode

I'm sorry, I thought I had a better grasp of the question than I did. I edited my answer to account for the changes in the question.

ADD REPLY • link 13.4 years ago by Eric Fournier ★ 1.4k

0

Entering edit mode

Thank you so much Eric, and I am sorry for my bad english

Best,

ADD REPLY • link 13.4 years ago by Ar Es ▴ 20