I guess easiest way is to show you what my data looks like in simpler way bp = basepair I have data of various chromosomes with start and end point of cluster tags in each of chromosomes it looks like :
columns = 1: chr 2: start 3:end 4: info(X) 5: info(X) 6:strand
chr1 101 105 X X -
chr1 101 105 X X -
chr1 101 105 X X -
chr1 102 108 X X -
chr1 102 108 X X -
chr1 102 108 X X -
chr1 106 111 X X -
chr1 112 113 X X -
chr1 112 113 X X -
chr1 112 113 X X -
chr1 112 113 X X -
chr1 113 115 X X -
chr2 114 118 X X -
chr2 119 121 X X -
chr2 120 123 X X -
chr3 125 130 X X -
chr3 131 132 X X -
I need column 1 - 2 -3 - 6
with mergeBed command output is like this :
TSSD_ID chr start end strand count
ID_1 1 101 111 - 7
ID_2 1 112 113 - 4
ID_3 1 113 115 - 1
ID_1 2 114 118 - 1
ID_2 2 119 123 - 2
ID_1 3 125 132 - 1
but I want to get this output with more details for Identical cordinates
TSSD_ID chr start end strand count
ID_1 1 101 105 - 3
ID_1 1 102 108 - 3
ID_1 1 106 111 - 1
ID_2 1 112 113 - 4
ID_3 1 113 115 - 1
ID_1 2 114 118 - 1
ID_2 2 119 123 - 2
ID_1 3 125 132 - 1
as you can see I want the more detailed in counting for each merge cordinates like ID_1 in chromosome 1 insteaad of 7 I want cordinates for each merged with ( 3 , 3 ,1 ) counts
post your script and we'll try to help
I edited what I could understand. You really need to be clearer on what you are trying to do with the overlaps. Maybe also post what you have of the code so far.
I think I understand now. You want to consolidate overlapping tags into one coordinate group and output the number of 2bp overlaps within the consolidated tags? Are your coordinates inclusive? 112-113 and 113-115 do overlap by 1 base if you consider 113 to be inclusive of the coordinate.
please be more specific and correct your formatting and typos, it is simply hard to read. please re-write the part where you describe the way you want to group the overlaps, it's hard to grasp what you want and you might not want a solution that is based on guessing that. Further, ovelaps of lenght exactly (==)2, or at least (>=)2. Last point, it makes no sense to do it in perl, it is much easier to solve such problems in R.
Hi , I never used R before, I need to count the overlaps >= 2 otherwise dont count them, as DK said 112-113 and 113-115 have over lap but less than 2 and I dont want count them . I hope it make sense !
I think I got it, simply speaking, you want to: 1. collapse your original intervals to a set of unique coordinate intervals, 2. for each unique interval you count the number of other intervals overlapping with overlap size of >=2. (This is like 2-5 lines in R plus reading the tables.) Please say yes if my interpretation is correct.
Yes Michael , thats it but I dont know how to use R to do that also stuck in Perl
I will neglect your request to write a perl program for you, because it is much more difficult and time consuming to do it in perl and you will end up with an inferior solution. Instead I will provide an example in R. If you are a beginner in perl as well, it will definitely pay off to learn to use the superior toolset for this class of problems, if you share my intention to work solution oriented and not tool-oriented. Sorry, if that sounds patronizing but simply trust me, because I have good experience with both R and Perl.
While trying to do so, I noticed that your description is too fuzzy and contradicts your example. Please clean up your question or it cannot be answered.
Thanks Micheal, I will try to find out how to solve it and I really appreciate your kindly help