Question

Problem to improve GBM (gradient boosting) model in r

1

Entering edit mode

5.2 years ago

carina2817 ▴ 20

Hello,

I am trying to test different parameters in the function gbm of R in order to make predictions with my data . I have a huge table of 79866 rows and 1586 columns where columns are counts for motifs in the DNA and rows indicate diferent regions/positions in the DNA and the organism to which the counts belong. There are only 3 organism but the counts are sepatated by the positions(peakid).

data looks like this:

chrII:11889760_11890077 worm  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
...

As I have problems with the memory that I don't know how to solve yet (because of the size of the table) so I am using a subset of the data:

motifs.table.sub<-motifs.table[1:1000, 1:1000]

set.seed(123)
motifs_split.sub <- initial_split(motifs.table.sub, prop = .7)
motifs_train.sub <- training(motifs_split.sub)

I create a table with different parameters to test

hyper_grid <- expand.grid( 
  shrinkage = c(.01, .1, .3), 
  interaction.depth = c(1, 3, 5),
  n.minobsinnode = c(5, 10, 15), 
  bag.fraction = c(.65, .8, 1),
  optimal_trees = 0,
  min_RMSE = 0)

Then I randomize the training data:

random_index.sub <- sample(1:nrow(motifs_train.sub), nrow(motifs_train.sub))
random_motifs_train.sub <- motifs_train.sub[random_index.sub, ]

test the different parameters with 1000 trees

for(i in 1:nrow(hyper_grid)) {#
  set.seed(123)
  gbm.tune <- gbm(
    formula = organism ~ .,
    distribution = "gaussian", #default
    data = random_motifs_train.sub,
    n.trees = 1000,
    interaction.depth = hyper_grid$interaction.depth[i],
    shrinkage = hyper_grid$shrinkage[i],
    n.minobsinnode = hyper_grid$n.minobsinnode[i],
    bag.fraction = hyper_grid$bag.fraction[i],
    train.fraction = 0.70,
    n.cores = NULL,
    verbose = V)
  print(head(gbm.tune$valid.error))}

The problem is that the model never improves:

 Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           nan             nan     0.0100       nan
     2           nan             nan     0.0100       nan
     3           nan             nan     0.0100       nan
     4           nan             nan     0.0100       nan
     5           nan             nan     0.0100       nan
     6           nan             nan     0.0100       nan
     7           nan             nan     0.0100       nan
     8           nan             nan     0.0100       nan
     9           nan             nan     0.0100       nan
    10           nan             nan     0.0100       nan

And values like valid.error are not calculated, they remain as 'NA'. I tried changing the size of the data subset and the same happens with all the parameters I am testing. My data table is huge and it has a lot of zeros, I thouth of removing motifs with low counts but don't think that would help since only 127 motifs out of 1586 have less than 10 counts. Any ideas of what I am doing wrong?

thanks!

PS: I am following this tutorial: http://uc-r.github.io/gbm_regression Edit: apparently TrainDeviance is nan if train.fraction is not <1. But not my case. https://stackoverflow.com/questions/23530165/gradient-boosting-using-gbm-in-r-with-distribution-bernoulli

gbm r test predictions • 2.0k views

ADD COMMENT • link updated 5.2 years ago by zx8754 12k • written 5.2 years ago by carina2817 ▴ 20

1

Entering edit mode

Don't know if the same is true for R, but sklearn's GBM classifier in python is extremely slow (mostly because it is single-threaded). There are faster and in general better GBM implementations, such as LightGBM. What's pertinent to your problem is that LightGBM works with sparse matrices, which would probably cut your memory needs by at least 70-80% if the row you showed above is representative of your data.