CPAT: error on coding probability cutoff
1
0
Entering edit mode
2.7 years ago

Dear all,

I am trying to find the lncRNA using the CPAT on dairy cows. To be able to determine the coding probability cutoff, I followed "How to choose cutoff" to generate the training dataset:

Here is how I did: Step 1: make_hexamer_tab.py -c /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.cds.all.fa -n /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.ncrna.fa > Bos_taurus_Hexamer.tsv

Step 2: make_logitModel.py -x Bos_taurus_Hexamer.tsv -c /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.cdna.all.fa.gz -n /storage/users/xdai/ref/cow/Bos_taurus.ARS-UCD1.2.ncrna.fa -o Bos_taurus

The cds file I download from Ensembl. The known coding protein-coding (cdna) and unknown protein-coding (ncrna), I all downloaded from Enseml. Based on the previous step, I generate the required train dataset with the heading of "names(data)[1]: "ID" "mRNA" "ORF" "Fickett" "Hexamer" "Label" (The same as shown on the website)

Then I used "10Fold_CrossValidation.r" that I download from the CPAT website to generate figure 3, to decide the cutoff coding potential value. In the step of "pred <- prediction(ROCR_data$predictions, ROCR_data$Labels)", it showed the following error:

Error in prediction(ROCR_data$predictions, ROCR_data$Labels): Number of classes is not equal to 2.ROCR currently supports only evaluation of binary classification tasks.

I open the generate "test1.xls" and found the labels all equal to "1". The original loaded data (trained dataset) has both "0" and "1". I did not change the code of "10Fold_CrossValidation.r". I have no idea what is going on. Could anyone please advise what is wrong with my steps and suggestions to fix this problem?

Many thanks.

lncRNA RNAseq CPAT • 970 views
ADD COMMENT
0
Entering edit mode
2.7 years ago

Updates: I got very great help from Dr Wang. My trained data is not balanced between coding and non-coding (4562 0s and 37988 1s). And all the coding genes are clustered together. Therefore, according to the suggestion of Dr Wang, I should shuffle my coding and non-coding data before running the R script. At the same time, I have more than 20,000 genes (the total genes in the "10Fold_CrossValidation.r"). Before running the R script, I will need to split my data into 10 data sets equally. So after this step. The errors in the prediction steps were gone.

However, I am facing other problems: 1.perf <- performance(pred,"tpr","fpr") Error in stats::approxfun(x.values.1, y.values.1, method = "constant", : zero non-NA points 2.d=performance(pred,measure="prec", x.measure="rec") Error in stats::approxfun(x.values.1, y.values.1, method = "constant", : zero non-NA points 3.plot(S,lwd=2,avg="vertical",add=TRUE,col="blue") Error in stats::approxfun(perf@x.values[[i]], perf@y.values[[i]], ties = mean, : need at least two non-NA values to interpolate 4.plot(P,lwd=2,avg="vertical",add=TRUE,col="red") Error in stats::approxfun(perf@x.values[[i]], perf@y.values[[i]], ties = mean, : need at least two non-NA values to interpolate

I am still looking for suggestion on solving it. Cheers.

ADD COMMENT
0
Entering edit mode

Hi,

Have you found a solution for this problem: Error in stats::approxfun(x.values.1, y.values.1, method = "constant", : zero non-NA points

ADD REPLY

Login before adding your answer.

Traffic: 2127 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6