cancer prediction using machine learning
1
0
Entering edit mode
8.7 years ago

HI ,

i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables. Now while i run the RF algorithm i encountered the following error

rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes) Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.

as i'm new to Machine learning i'm unable to proceed. kindly do the needful.

 enter code here
>clindata=clin_data_import[clincaldata_order,]
> data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
> rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
> header=colnames(rawdata)
> X=rawdata[,4:length(header)]
> ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
> filt=genefilter(2^X,ffun)
> filt_Data=rawdata[filt,]
> write.table(filt_Data,file="filt_Data_new.txt")
> predictor_data=t(filt_Data[,4:length(header)])
> predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
> colnames(predictor_data)=predictor_names
> target= clindata[,"relapse"]
> target[target==0]="NoRelapse"
> target[target==1]="Relapse"
> target=as.factor(target)
> tmp = as.vector(table(target))
> num_classes = length(tmp)
> min_size = tmp[order(tmp,decreasing=FALSE)[1]]
> sampsizes = rep(min_size,num_classes)
>rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes) Error in randomForest.default(x = pred.data, y = target, importance = TRUE,  :  Can not handle categorical predictors with more than 53 categories.

thanks in advance.

R software error • 3.1k views
ADD COMMENT
1
Entering edit mode

Your code in unreadable . Please provide some sample data with code .

I have found this error message inside randomForest package (randomForest.default.R : Line 88 )

maxcat <- max(ncat)
if (maxcat > 53)
    stop("Can not handle categorical predictors with more than 53 categories.")
  

If this is the case , you have to reduce the category. (Very rough guess)

ADD REPLY
0
Entering edit mode

Yeah I think there are now more than 53 categories defined, but if I understand it right Kavya Krishnamurthy only wants "NoRelapse" and "Relapse" groups. So my guess is that in defining num_classes, sampsizes, or targets something goes wrong...

ADD REPLY
0
Entering edit mode

Gee, your code is unreadable for humans...

ADD REPLY
0
Entering edit mode

Fixed partially, still a few non-authoritative hints or cultural conventions to make readable R-code, Note: this is no criticism on you, your coding style or anyone, it is just meant to get more readable code and better help! I am just mentioning it as a response on the above comment.

  • most R-code I have seen uses . over _ in variable names, this is maybe _ was once an assignment operator in S-plus. In ESS emacs mode _ still is a macro expanding to <-
  • R has the nice left and right <-, -> assignment operators, they make very readable code
  • using a consistent spacing around assignments also helps to make more readable code
  • I personally prefer to not have the initial > in the code because it makes it easier to copy paste and run selected statements. Also, now biostars formatter makes these into 'blockquotes'.

This is my honest opinion, if you don't like it, I can give you another one :)

ADD REPLY
0
Entering edit mode

Dear All, Thanks for your reply and apologies for the illegible code. Is there any way i can attach a sample data for the code so that you can help me out easily. (I'm new to Biostar)

code:

library(randomForest)

library(ROCR)

library(Hmisc)

library(genefilter)

datafile<-"trainset_gcrma.txt"

clindatafile<-read.csv("mod clinical_details.csv")

outfile<-"trainset_RFoutput.txt"

varimp_pdffile<-"trainset_varImps.pdf"

MDS_pdffile<-"trainset_MDS.pdf"

ROC_pdffile<-"trainset_ROC.pdf"

case_pred_outfile<-"trainset_CasePredictions.txt"

vote_dist_pdffile<-"trainset_vote_dist.pdf"

data_import<-read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")

clin_data_import<-clindatafile

clincaldata_order<-order(clin_data_import[,"GEO.asscession.number"])

clindata<-clin_data_import[clincaldata_order,]

data_order<-order(colnames(data_import)[4:length(colnames(data_import))])+3 

rawdata<-data_import[,c(1:3,data_order)] 

header<-colnames(rawdata)

X<-rawdata[,4:length(header)]

ffun<-filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))

filt<-genefilter(2^X,ffun)

filt_Data<-rawdata[filt,]

predictor_data<-t(filt_Data[,4:length(header)])

predictor_names<-c(as.vector(filt_Data[,3])) 

colnames(predictor_data)<-predictor_names

target<- clindata[,"relapse"]

target[target==0]="NoRelapse"

target[target==1]="Relapse"

target<-as.factor(target)

tmp <- as.vector(table(target))

num_classes <- length(tmp)

min_size <- tmp[order(tmp,decreasing=FALSE)[1]]

sampsizes <-rep(min_size,num_classes)

rf_output<-randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)

"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."

Thank you.

ADD REPLY
0
Entering edit mode

Hi kavya, could you please run your code from R --vanilla and report the output.

ADD REPLY
1
Entering edit mode
8.7 years ago
Michael 55k

Hi,

I am not completely sure, but it might be a typo only, or the code example is incomplete:

rf_output=randomForest(x=**pred.data**, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)

but before you had defined a variable predictor_data and I don't see any mention of pred.data above, so maybe you would be fine with

rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)

R has very little protection against using the wrong data (maybe imported from previous sessions). If you want to be on the safe side for an important analysis you can run the analsysis from a script or session using e.g. R --vanilla to avoid having a polluted namespace.

ADD COMMENT
0
Entering edit mode

HI Michael,

Apparently at first with the code below i came across the following error.

rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error: Error in randomForest.default(x = predictor_data, y = target, importance = TRUE,  : 
  length of response must be the same as predictors

dim(predictor_data)
[1]  285 2246

> length(target)
[1] 286

so i tried to read the "predictor_data.csv" into a vector.

predictor_data=t(filt_Data[,4:length(header)])
write.table(predictor_data,file="predictor_data.csv")
pred.data<-read.table(file="predictor_data.csv")
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error in randomForest.default(x = pred.data, y = as.factor(target), importance = TRUE,  : 
Can not handle categorical predictors with more than 53 categories.
ADD REPLY
0
Entering edit mode

hi again,

i'm trying to run the code thats given in this biostar tuorital - Machine Learning For Cancer Classification

ADD REPLY

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6