Hello,
I am trying to test different parameters in the function gbm of R in order to make predictions with my data . I have a huge table of 79866 rows and 1586 columns where columns are counts for motifs in the DNA and rows indicate diferent regions/positions in the DNA and the organism to which the counts belong. There are only 3 organism but the counts are sepatated by the positions(peakid).
data looks like this:
chrII:11889760_11890077 worm 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...
As I have problems with the memory that I don't know how to solve yet (because of the size of the table) so I am using a subset of the data:
motifs.table.sub<-motifs.table[1:1000, 1:1000]
set.seed(123)
motifs_split.sub <- initial_split(motifs.table.sub, prop = .7)
motifs_train.sub <- training(motifs_split.sub)
I create a table with different parameters to test
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1, 3, 5),
n.minobsinnode = c(5, 10, 15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0,
min_RMSE = 0)
Then I randomize the training data:
random_index.sub <- sample(1:nrow(motifs_train.sub), nrow(motifs_train.sub))
random_motifs_train.sub <- motifs_train.sub[random_index.sub, ]
test the different parameters with 1000 trees
for(i in 1:nrow(hyper_grid)) {#
set.seed(123)
gbm.tune <- gbm(
formula = organism ~ .,
distribution = "gaussian", #default
data = random_motifs_train.sub,
n.trees = 1000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = 0.70,
n.cores = NULL,
verbose = V)
print(head(gbm.tune$valid.error))}
The problem is that the model never improves:
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.0100 nan
2 nan nan 0.0100 nan
3 nan nan 0.0100 nan
4 nan nan 0.0100 nan
5 nan nan 0.0100 nan
6 nan nan 0.0100 nan
7 nan nan 0.0100 nan
8 nan nan 0.0100 nan
9 nan nan 0.0100 nan
10 nan nan 0.0100 nan
And values like valid.error are not calculated, they remain as 'NA'. I tried changing the size of the data subset and the same happens with all the parameters I am testing. My data table is huge and it has a lot of zeros, I thouth of removing motifs with low counts but don't think that would help since only 127 motifs out of 1586 have less than 10 counts. Any ideas of what I am doing wrong?
thanks!
PS: I am following this tutorial: http://uc-r.github.io/gbm_regression Edit: apparently TrainDeviance is nan if train.fraction is not <1. But not my case. https://stackoverflow.com/questions/23530165/gradient-boosting-using-gbm-in-r-with-distribution-bernoulli
Don't know if the same is true for
R
, but sklearn's GBM classifier inpython
is extremely slow (mostly because it is single-threaded). There are faster and in general better GBM implementations, such as LightGBM. What's pertinent to your problem is that LightGBM works with sparse matrices, which would probably cut your memory needs by at least 70-80% if the row you showed above is representative of your data.Thank you! I am gonna try with LightGBM