HI ,
i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables. Now while i run the RF algorithm i encountered the following error
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes) Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.
as i'm new to Machine learning i'm unable to proceed. kindly do the needful.
enter code here
>clindata=clin_data_import[clincaldata_order,]
> data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
> rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
> header=colnames(rawdata)
> X=rawdata[,4:length(header)]
> ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
> filt=genefilter(2^X,ffun)
> filt_Data=rawdata[filt,]
> write.table(filt_Data,file="filt_Data_new.txt")
> predictor_data=t(filt_Data[,4:length(header)])
> predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
> colnames(predictor_data)=predictor_names
> target= clindata[,"relapse"]
> target[target==0]="NoRelapse"
> target[target==1]="Relapse"
> target=as.factor(target)
> tmp = as.vector(table(target))
> num_classes = length(tmp)
> min_size = tmp[order(tmp,decreasing=FALSE)[1]]
> sampsizes = rep(min_size,num_classes)
>rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes) Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.
thanks in advance.
Your code in unreadable . Please provide some sample data with code .
I have found this error message inside randomForest package (randomForest.default.R : Line 88 )
If this is the case , you have to reduce the category. (Very rough guess)
Yeah I think there are now more than 53 categories defined, but if I understand it right Kavya Krishnamurthy only wants "NoRelapse" and "Relapse" groups. So my guess is that in defining
num_classes
,sampsizes
, ortargets
something goes wrong...Gee, your code is unreadable for humans...
Fixed partially, still a few non-authoritative hints or cultural conventions to make readable R-code, Note: this is no criticism on you, your coding style or anyone, it is just meant to get more readable code and better help! I am just mentioning it as a response on the above comment.
.
over_
in variable names, this is maybe_
was once an assignment operator in S-plus. In ESS emacs mode_
still is a macro expanding to<-
<-
,->
assignment operators, they make very readable code>
in the code because it makes it easier to copy paste and run selected statements. Also, now biostars formatter makes these into 'blockquotes'.This is my honest opinion, if you don't like it, I can give you another one :)
Dear All, Thanks for your reply and apologies for the illegible code. Is there any way i can attach a sample data for the code so that you can help me out easily. (I'm new to Biostar)
code:
Thank you.
Hi kavya, could you please run your code from
R --vanilla
and report the output.