Question

improve my slow R function

1

Entering edit mode

10.3 years ago

lilepisorus ▴ 40

I am trying to calculate the percentage of heterozygous sites for each sample. I wrote a R script using stringr to count the number of heterozygous sites and divided by the number of known sites (the number of all sites minus the missing sites "N"). Here is my function:

The column of my data frame is for each sample and the row is for each SNP site.

PercentPolymorphism <- function(df) {
countN=c(); 
knownSitesN=c();
heteroSitesN=c();
PercPolym=c()
for (i in 1:ncol(df)) {
    countN[i]= sum(str_count(df[, i], "N"));
    knownSitesN[i]=nrow(df) - countN[i]
    heteroSitesN[i] = sum(str_count(df[, i], c("R", "Y", "M", "K", "S", "W")));
    PercPolym[i] = heteroSitesN[i]/knownSitesN[i]
}#for

df.new=as.data.frame(rbind(knownSitesN, heteroSitesN, PercPolym, df))
return(df.new)
} #function

I have around 5000 samples and merely 1Mb SNP data. This function works with the data without throwing any error, but it takes very long to complete.

Any one could suggest to modify my function in order to improve the computation efficiency?

Thanks in advance!

Li

R • 6.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by lilepisorus ▴ 40

0

Entering edit mode

Please post example data.

ADD REPLY • link 10.3 years ago by PoGibas 5.1k

0

Entering edit mode

An example is as follows:

   sample1   sample2   sample3  sample4   sample5  
Site1 A       T       C       T       W      W
Site2 A       W       M       N       N       N
Site3 R       A       A       A       N       A
Site4 A       T       C       W       T       T
Site5 G       A       N       N       A       A

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by lilepisorus ▴ 40

0

Entering edit mode

This contains six sample columns, not five.

ADD REPLY • link 10.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

use a character matrix instead of a data frame

ADD REPLY • link 10.3 years ago by Jeremy Leipzig 23k

Ram · Answer 1 · 2015-04-20

4

Entering edit mode

10.3 years ago

Devon Ryan 105k

PercentPolymorphism <- function(df) {
countN=c(); 
knownSitesN=c();
heteroSitesN=c();
PercPolym=c()
countN = lapply(df, function(x) sum(str_count(x, "N")))
knownSitesN = nrow(df)-c(1:ncol(df))

and so on. You can also use mclapply() rather than just lapply(), to take advantage of multiple cores. Avoid for loops in R.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Devon Ryan 105k

3

Entering edit mode

While it's true that for-loops in R are slow, the use of *apply is not necessarily better. I think *apply functions use for loops anyway. Here's an example:

system.time({
## FOR LOOP
ilog<- rep(NA, 1000000)
for(i in 1:1000000){
    ilog[i]<- log(i)
}
})
   user  system elapsed
  1.401   0.004   1.403

system.time({
## SAPPLY
ilog<- sapply(1:1000000, log)
})
   user  system elapsed
  3.394   0.011   3.403

system.time({
+ ## LAPPLY
+ ilog<- lapply(1:1000000, log)
+ })
   user  system elapsed
  2.564   0.004   2.565

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by dariober 15k

1

Entering edit mode

I borrowed ideas from this thread by using foreach and multicores:

library(foreach)
library(doMC)
registerDoMC()

#change the previous: for (i in 1:ncol(df)) {         
foreach(i=1:ncol(df)) %dopar% {

I am expecting it will largely reduce the computing time, as I am using a 32 core server.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by lilepisorus ▴ 40

0

Entering edit mode

You can use snow, which is a parallelization library built off of MPI. Really easy to use, you add a few lines of set up and then any place you have an apply, snow provides a parallel version you can just drop in.

ADD REPLY • link 10.3 years ago by pld 5.1k

0

Entering edit mode

Glad they've gotten faster, it's often annoying to write everything in a functional programming style.

ADD REPLY • link 10.3 years ago by Devon Ryan 105k

Ram · Answer 2 · 2015-04-20

To solve this problem you must first convert your data frame to a long format.

Your file:

site sample1 sample2 sample3 sample4 sample5 sample6
Site1 A T C T W W
Site2 A W M N N N
Site3 R A A A N A
Site4 A T C W T T
Site5 G A N N A A

Code:

> library(tidyr)
> library(dplyr)
> mydf = read.table('biostar.csv', header=T)
> mydf.long = mydf %>% gather(sample, genotype, -site)
> mydf.long %>% head
    site  sample genotype
1  Site1 sample1        A
2  Site2 sample1        A
3  Site3 sample1        R
4  Site4 sample1        A
5  Site5 sample1        G
6  Site1 sample2        T
7  Site2 sample2        W
8  Site3 sample2        A

EDIT: if you prefer the reshape2 library instead of tidyr:

> library(reshape2)
> mydf.long = mydf %>% 
   melt(id.vars=c('site'), variable.name='sample') %>% 
   rename(genotype=value)

Now I am not really sure about what your function must do, but the principle is to use group_by and summarise from dplyr.

> mysummary = mydf.long %>%
    group_by(sample) %>%
    summarise(
        totN         = sum(genotype == 'N', na.rm=T),  # in alternative length(site[genotype=='N'])
        knownSitesN  = sum(genotype != 'N', na.rm=T),
        heteroSitesN = sum (genotype %in% c("R", "Y", "M", "K", "S", "W"), na.rm=T),
        percPolyM    = heteroSitesN/knownSitesN
    )

> mysummary
   sample totN knownSitesN heteroSitesN percPolyM
1 sample1    0           5            1 0.2000000
2 sample2    0           5            1 0.2000000
3 sample3    1           4            1 0.2500000
4 sample4    2           3            1 0.3333333
5 sample5    2           3            1 0.3333333
6 sample6    1           4            1 0.2500000

Again, instead of doing operations on columns, you must reformat the dataframe to a long format, so that the sites are on the rows instead. Then you can use the standard group_by operations in dplyr to calculate your summaries. You never need to iterate on columns in R.

Ram · Answer 3 · 2015-04-20

1

Entering edit mode

10.3 years ago

Jeremy Leipzig 23k

foo<-paste(sample(c("A","C","G","T","R","Y","M","K","S","W","N"),2500,rep=T),sep='',collapse='')
bar<-matrix(unlist(strsplit(foo,'')),ncol=500)

useful_stats<-function(x){
  countN<-sum(x=='N')
  known<-sum(x!='N')
  hetero<-sum(x %in% c("R", "Y", "M", "K", "S", "W"))
  percpoly<-as.numeric(hetero)/as.numeric(known)
  list("countN"=countN,"known"=known,"hetero"=hetero,"percpoly"=percpoly)
}

seems pretty fast

> system.time(apply(bar,2,useful_stats))
   user  system elapsed 
  0.018   0.000   0.019

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Jeremy Leipzig 23k

0

Entering edit mode

It is good for only few columns, but when it extend to the whole dataset. I still got a memory issue:

Error in aperm.default(X, c(s.call, s.ans)) :
  long vectors not supported yet: memory.c:1636
Calls: apply -> aperm -> aperm.default
Execution halted

I will try to truncate my dataset into several pieces and see if it goes through.

Thanks!

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by lilepisorus ▴ 40

Ram · Answer 4 · 2015-04-20

Pre-allocate space for your vectors; otherwise, you force R to reallocate space as the vectors grow. This can really slow down R loops (cf. http://musicallyut.blogspot.com/2012/07/pre-allocate-your-vectors.html):

> some_vector <- vector("numeric", ncol(some_data_frame))
> # do stuff with some_vector...

This behaves with somewhat similar performance as dariober's recommendation to use rep():

> system.time({
+ ilog <- rep(NA, 1000000)
+ for (i in 1:1000000){
+     ilog[i] <- log(i)
+ }
+ })
   user  system elapsed 
  1.495   0.020   1.522 

> system.time({
+ ilog <- vector("numeric", 1000000)
+ for (i in 1:1000000) {
+     ilog[i] <- log(i)
+ }
+ })
   user  system elapsed 
  1.505   0.016   1.528

Use rbindlist() instead of rbind() to make your data frame (cf. http://stackoverflow.com/questions/15673550/why-is-rbindlist-better-than-rbind).

See The R Inferno for a larger discussion on why growing objects in R is not ideal. It is a very useful guide on navigating some of R's pitfalls.

Ram · Answer 5 · 2015-04-20

0

Entering edit mode

10.3 years ago

komal.rathi ★ 4.1k

Not sure what your function is doing but you can do something like below by changing your function to work on vectors instead of dataframes/matrices. Use aaply to apply a function across columns in a parallel manner:

# load libraries
library(plyr)
library(doMC) 

# register 8 cores (you can give more depending on the access)
registerDoMC(8) 

# function definition
# not really sure what your function is doing 
# modify function to work on vectors instead of dataframe/matrix
# i.e. treat each column as vector
PercentPolymorphism <- function(x){
  countN = sum(x %in% "N")
  knownSitesN = length(x) - countN
  heteroSitesN = sum(x %in% c("R", "Y", "M", "K", "S", "W"))
  PercPolym = heteroSitesN/knownSitesN
  df = data.frame(knownSitesN,heteroSitesN,PercPolym)
}

# use parallel = TRUE
# margins = 2 to apply function on columns
res <- data.frame()
res <- rbind(res, aaply(.data = df, .margins = 2, .fun = PercentPolymorphism, .parallel = TRUE))

ADD COMMENT • link 10.3 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Thanks!

I bet to write function on vectors is the most efficient way to do it.

However, when go to step apply

res <- rbind(res, aaply(.data = df, .margins = 2, .fun = PercentPolymorphism, .parallel = TRUE))

I still get the memory errors:

Error in aperm.default(X, c(s.call, s.ans)) :
  long vectors not supported yet: memory.c:1636
Calls: as.data.frame -> rbind -> apply -> aperm -> aperm.default
Execution halted

I ended up by subset my big dataframe into several small pieces. I have around 5000 columns, and I do 1000 each time, and it can go through taking around 15-20 minutes. At last, I put the results file together.

Thanks, everyone! If someone comes up with the idea, what is behind the limitation, I would love to hear.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by lilepisorus ▴ 40