Question

R Help: For Loop Over Range Of Number And Calculate Average Using An If Statement

4

Entering edit mode

10.8 years ago

714 ▴ 110

Hi guys,

I have a file (named DP.2L) which looks like this:

            CHROM        POS       SAMPLE_1     
            1            1168      47
            1            1197      40
            1            1202      45

POS ranges from 1168 to 49359284. I generated a sequence vector containing so that I get bin sizes starting from 1168, to 49359284 by 10,000 steps:

            > seqWrapper <- function(lb, ub, by=1) {
            +  s <- c()
            + if(!ub <= lb) s <- seq(lb,ub, by=by)
            + return(s)
            + }
            > x<-seqWrapper(lb=1168, ub=49359284,by=10000)

I want to sequentially calculate the average values of SAMPLE_1 if the value of x is less than or equal to the value of POS but more than the value of the next value of x. For example if x values are :1168 11168 21168 , then I want to first look at the first x value (1168) and compare it with the POS column. I want it to calculate the average of the corresponding rows in SAMPLE_1 if the value of its corresponding POS is less than or equal to the first values of x (1168). But for the next x value (11168), I want the average should only be calculated x is <= POS but > the first x value (1168).

I got as far as this but its not right....

            t<- for (i in x) { if (DP.2L<=x & DP.2L > x[i]) {(mean(DP.2L$SAMPLE_1))} }

r • 56k views

ADD COMMENT • link updated 8.9 years ago by Biostar 20 • written 10.8 years ago by 714 ▴ 110

1

Entering edit mode

Your analysis is not really 'sliding-window' but 'consecutive windows' because in sliding-windows are overlapping in my understanding, but according to your definition your regions do not overlap. I suggest to consider whether this is really what you want and if it is appropriate at all. I can just guess that you have data from a a tiling array or ChIP-chip experiment, but in this case for plotting, a real sliding window approach might be preferable.

ADD REPLY • link 10.8 years ago by Michael 55k

0

Entering edit mode

Yes, Michael what I actually meant to say was a non-overlapping window. As I understood a sliding window can be overlapping or non-overlapping, because the term "sliding" refers to the analysis sliding across the data.

ADD REPLY • link 10.8 years ago by 714 ▴ 110

score 6 · Answer 1 · 2014-02-03

In short, if you use a for loop in R, you are most likely doing it wrong.

Convert your data into a Rle object (run-length encoded data) and use IRanges Views to accomplish a more efficient solution of the naive implementation (corrected your approach, didn't test it though, because it is inefficient to uselessness for larger datasets).

Now let's take this implementation as a definition of what you want to accomplish:

    res = numeric();
    for (i in 1:length(x)-1) { 
      tmp <- DP.2L[DP.2L$POS<=x[i+1] & DP.2L$POS > x[i], ];
      res = c(res,(mean(tmp$SAMPLE_1)); 
    }

The IRanges Views and RLE - classes provides a toolset to approach this most efficiently and elegantly:

library(IRanges)
## first create some dummy data
rle <- Rle(NA,lengths=49359284) # Rle-vector
## you'd use:
# ind <- DP.2L$pos
ind <- seq(1, 49359284, by=10) # insert data at every 10. pos
rle[ind] <- rpois(length(ind), 40) # let data be poisson distributed around 40, as in your example ;) 
## ofc you'd use instead:
# rle[ind] <- DP.2L$SAMPLE_1
## now create all windows of size 10000 in one go, starting from position 1168 as required:
my.views <- successiveViews(rle, width=rep(10000, 49359284/10000), from=1168)
my.views
## ... 
## voila, compute the means of all your windows
viewMeans(rle.views, na.rm=TRUE)

Now, you can compare the performance to any other solution on your data.

see also here: https://stat.ethz.ch/pipermail/bioconductor/2010-September/035532.html

score 4 · Answer 2 · 2014-02-03

4

Entering edit mode

10.8 years ago

zx8754 12k

It would help if you clarify the "why", the reason for doing what you are trying to do, there might be a better way.

Anyway, this example should help with your question:

#dummy data
df <- data.frame(CHROM=rep(1,100),
                 POS=1:100,
                 SAMPLE_1=round(runif(100,40,80)))
#create bins
bin <- seq(1,100,5)

#get mean per bin
sapply(1:(length(bin)-1),
       function(i)
         mean(df[ df$POS>=bin[i] &
                    df$POS < bin[i+1], "SAMPLE_1"]))

ADD COMMENT • link 10.8 years ago by zx8754 12k

0

Entering edit mode

Thanks so much! Sorry I forgot to mention that I am trying to perform a sliding window analysis using genomic data

ADD REPLY • link 10.8 years ago by 714 ▴ 110