Splitting a dataframe into overlapping groups of equal sizes
0
0
Entering edit mode
6.6 years ago
user_g ▴ 20

Hello, I am looking for a way to split my data into groups where each group is made of the same window size I define.

Chrom     Start   End        
 chr1       1    10      
 chr1       11   20      
 chr1       21   30     
 chr1       31   40

For example, if I want a window size of 20, then the groups would be : 1-20 , 11-30 , 21 - 40. As long as the size of the group did not exceed 20 it can keep adding to the same group.

I tried using the split function but couldn't implement this way using it. Is there a way around this?

r GenomicRange IRange GRange • 2.4k views
ADD COMMENT
1
Entering edit mode

Some questions :

Do you have a dataframe or a GRange ? Your example data looks like a dataframe but you mentionned GRange.

Can a same row goes to different groups ?

Also, is the start column automatically create a new dataframe ? By example if you had a row c("chr1","16","25"), this will create a dataframe from 16 to 35. In this case you will have as many dataframe as rows...

What do you want to achieve after that splicing ?

ADD REPLY
0
Entering edit mode

I am alternating between data frames and GRanges to find the perfect way to achieve this, so I if I could find a way to do this in GRanges then I will convert my data frame into a GRange object.

Yes the same row can be in another group.

Yes thats true, I will end up having the same number of clusters as the number of rows but these clusters will be rows in a data frame or GRange object not each row an independent data frame.

I need these clusters to study them further in the next stage.

ADD REPLY
0
Entering edit mode

Why not use a for loop over your dataframe and then do your process in the loop ?

Something like this :

df <- data.frame(c("chr1", "chr1", "chr1", "chr1"), c(1, 11, 21, 31), c(10, 20, 30, 40))
colnames(df) <- c("chrom", "start", "end")
window_size<-20

for (row in 1:nrow(df)) {
    df_cluster <- df[df$start >= df[row, "start"] & df$start < df[row, "start"]+window_size,]
    ###Here you can process each cluster
    ###Create your GRange
    my_GRange <- toGRanges(df_cluster)}

If you really need all your GRanges at the same time you can create a list of GRanges before the loop, append it in the loop and use it after the loop.

ADD REPLY
0
Entering edit mode

hello, yes I tried using the for loop but when dealing with large data, it became very slow this is why I am looking for another way

ADD REPLY

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6