R Help: For Loop Over Range Of Number And Calculate Average Using An If Statement
2
4
Entering edit mode
10.8 years ago
714 ▴ 110

Hi guys,

I have a file (named DP.2L) which looks like this:

            CHROM        POS       SAMPLE_1     
            1            1168      47
            1            1197      40
            1            1202      45

POS ranges from 1168 to 49359284. I generated a sequence vector containing so that I get bin sizes starting from 1168, to 49359284 by 10,000 steps:

            > seqWrapper <- function(lb, ub, by=1) {
            +  s <- c()
            + if(!ub <= lb) s <- seq(lb,ub, by=by)
            + return(s)
            + }
            > x<-seqWrapper(lb=1168, ub=49359284,by=10000)

I want to sequentially calculate the average values of SAMPLE_1 if the value of x is less than or equal to the value of POS but more than the value of the next value of x. For example if x values are :1168 11168 21168 , then I want to first look at the first x value (1168) and compare it with the POS column. I want it to calculate the average of the corresponding rows in SAMPLE_1 if the value of its corresponding POS is less than or equal to the first values of x (1168). But for the next x value (11168), I want the average should only be calculated x is <= POS but > the first x value (1168).

I got as far as this but its not right....

            t<- for (i in x) { if (DP.2L<=x & DP.2L > x[i]) {(mean(DP.2L$SAMPLE_1))} }
r • 56k views
ADD COMMENT
1
Entering edit mode

Your analysis is not really 'sliding-window' but 'consecutive windows' because in sliding-windows are overlapping in my understanding, but according to your definition your regions do not overlap. I suggest to consider whether this is really what you want and if it is appropriate at all. I can just guess that you have data from a a tiling array or ChIP-chip experiment, but in this case for plotting, a real sliding window approach might be preferable.

ADD REPLY
0
Entering edit mode

Yes, Michael what I actually meant to say was a non-overlapping window. As I understood a sliding window can be overlapping or non-overlapping, because the term "sliding" refers to the analysis sliding across the data.

ADD REPLY
6
Entering edit mode
10.8 years ago
Michael 55k

In short, if you use a for loop in R, you are most likely doing it wrong.

Convert your data into a Rle object (run-length encoded data) and use IRanges Views to accomplish a more efficient solution of the naive implementation (corrected your approach, didn't test it though, because it is inefficient to uselessness for larger datasets).

Now let's take this implementation as a definition of what you want to accomplish:

    res = numeric();
    for (i in 1:length(x)-1) { 
      tmp <- DP.2L[DP.2L$POS<=x[i+1] & DP.2L$POS > x[i], ];
      res = c(res,(mean(tmp$SAMPLE_1)); 
    }

The IRanges Views and RLE - classes provides a toolset to approach this most efficiently and elegantly:

library(IRanges)
## first create some dummy data
rle <- Rle(NA,lengths=49359284) # Rle-vector
## you'd use:
# ind <- DP.2L$pos
ind <- seq(1, 49359284, by=10) # insert data at every 10. pos
rle[ind] <- rpois(length(ind), 40) # let data be poisson distributed around 40, as in your example ;) 
## ofc you'd use instead:
# rle[ind] <- DP.2L$SAMPLE_1
## now create all windows of size 10000 in one go, starting from position 1168 as required:
my.views <- successiveViews(rle, width=rep(10000, 49359284/10000), from=1168)
my.views
## ... 
## voila, compute the means of all your windows
viewMeans(rle.views, na.rm=TRUE)

Now, you can compare the performance to any other solution on your data.

see also here: https://stat.ethz.ch/pipermail/bioconductor/2010-September/035532.html

ADD COMMENT
4
Entering edit mode
10.8 years ago
zx8754 12k

It would help if you clarify the "why", the reason for doing what you are trying to do, there might be a better way.

Anyway, this example should help with your question:

#dummy data
df <- data.frame(CHROM=rep(1,100),
                 POS=1:100,
                 SAMPLE_1=round(runif(100,40,80)))
#create bins
bin <- seq(1,100,5)

#get mean per bin
sapply(1:(length(bin)-1),
       function(i)
         mean(df[ df$POS>=bin[i] &
                    df$POS < bin[i+1], "SAMPLE_1"]))
ADD COMMENT
0
Entering edit mode

Thanks so much! Sorry I forgot to mention that I am trying to perform a sliding window analysis using genomic data

ADD REPLY

Login before adding your answer.

Traffic: 1770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6