Entering edit mode
4.5 years ago
Hann
▴
110
Hello,
I am trying to write a code in R to get the GO label has the highest confident score that comes after " | " symbol
For each gene ID (each row), there are many Go labels (columns), it can go up to 400 labels. And the Go-term with highest confident score can be in any column.
see example:
GeneID GO_01 GO_02 GO_03 GO_04
exi2A01G0001540.1 GO:0005575|0.853 GO:0005622|0.705 GO:0005623|0.846 GO:0005634|0.531
exi2A01G0001560.1 GO:0005575|0.324 GO:0044699|0.319 GO:0044464|0.324 GO:0005623|0.524
exi9A01G0045270.1 GO:0003674|0.356 GO:0005575|0.679 GO:0005622|0.539
I think it's possible to retain the GO-labels that has the highest confident score.
So for example results would be like this:
GeneID GO-term
exi2A01G0001540.1 GO:0005575|0.853
exi2A01G0001560.1 GO:0005623|0.524
exi9A01G0045270.1 GO:0005575|0.679
I srarted R code:
GO_1 <- read.table("proteinGO-term_0.3.txt", header=T, sep="\t", fill=T)
#have gene ID as a row name:
GO_2 <- GO_1[,-1]
rownames(GO_2) <- GO_1[,1]
#
#I tried this, but it doesn't do what I want:
test <- apply(GO_2,1,function(x) which(x==max(x)))
Thanks !!!
This code doesn't output the gene id It only retains the GO terms.
The output looks like:
Please post the example data. Code works on the data furnished in OP. If you are concerned about column names, you can do this:
I make the first column as row names. It's working now. Thanks a lot!!!!!!
Glad that it worked. But it is not supposed to work that way given the data you posted in OP.
Yeah you are right
head(test[,1])
Showed me the gene ID, but I have no idea, running the whole line, doesn't show the gene ID but anyway the good thing now it worked when making the first column as row namesHowever, there is another problem. When a gene has only one GO-term, it leaves it empty. But it should retain that only Go-label for the corresponding gene. right?
sure. It should do so. It also depends on you are reading input (txt) file in to R. I have created example file, where the GO term is present in only column, but absent in all other columns, for few GO columns. Follow the code below:
Gene 4 has entry in third column and all other columns are empty. Gene 5 has entry in 2nd column and all other columns are empty. Gene 6 has entry in 1st column and all other columns are empty. Here is the R code:
The code is perfectly working. It was something to do with the input file.
Thank you very much! I really appreciate your help :)
I moved @cpad112's comment to an answer. Please accept (green checkmark) to provide closure to this thread.
No problem. Keep visiting and contributing to Biostars.