For each row retain the cell with maximum value in R
2
0
Entering edit mode
4.5 years ago
Hann ▴ 110

Hello,

I am trying to write a code in R to get the GO label has the highest confident score that comes after " | " symbol

For each gene ID (each row), there are many Go labels (columns), it can go up to 400 labels. And the Go-term with highest confident score can be in any column.

see example:

GeneID         GO_01          GO_02           GO_03          GO_04
exi2A01G0001540.1      GO:0005575|0.853        GO:0005622|0.705        GO:0005623|0.846        GO:0005634|0.531
exi2A01G0001560.1      GO:0005575|0.324        GO:0044699|0.319        GO:0044464|0.324        GO:0005623|0.524
exi9A01G0045270.1      GO:0003674|0.356        GO:0005575|0.679        GO:0005622|0.539

I think it's possible to retain the GO-labels that has the highest confident score.

So for example results would be like this:

GeneID      GO-term
exi2A01G0001540.1   GO:0005575|0.853
exi2A01G0001560.1   GO:0005623|0.524
exi9A01G0045270.1   GO:0005575|0.679

I srarted R code:

GO_1 <- read.table("proteinGO-term_0.3.txt", header=T, sep="\t", fill=T)
#have gene ID as a row name:
GO_2 <- GO_1[,-1]
rownames(GO_2) <- GO_1[,1]
#
#I tried this, but it doesn't do what I want:
test <- apply(GO_2,1,function(x) which(x==max(x)))

Thanks !!!

R • 1.2k views
ADD COMMENT
1
Entering edit mode
4.5 years ago
> cbind(test[,1],(t(apply(test[,-1],1, function(x) x[order(as.numeric(sub("\\w*:[0-9]*\\|","",x)),decreasing = T)]))))[,1:2]
     [,1]                [,2]              
[1,] "exi2A01G0001540.1" "GO:0005575|0.853"
[2,] "exi2A01G0001560.1" "GO:0005623|0.524"
[3,] "exi9A01G0045270.1" "GO:0005575|0.679"
ADD COMMENT
0
Entering edit mode

This code doesn't output the gene id It only retains the GO terms.

The output looks like:

"6937"  "GO:0005575|0.868" 
"6938"  "GO:0005575|0.876"
"6939"  "GO:0005575|0.399"
"6941"  "GO:0005575|0.345"
ADD REPLY
0
Entering edit mode

Please post the example data. Code works on the data furnished in OP. If you are concerned about column names, you can do this:

$ cbind("GeneID"=test[,1],"GO-term"=apply(test[,-1],1, function(x) x[order(as.numeric(sub("\\D+\\d+\\D","",x)),decreasing = T)][1]))

     GeneID              GO-term           
[1,] "exi2A01G0001540.1" "GO:0005575|0.853"
[2,] "exi2A01G0001560.1" "GO:0005623|0.524"
[3,] "exi9A01G0045270.1" "GO:0005575|0.679"
ADD REPLY
0
Entering edit mode

I make the first column as row names. It's working now. Thanks a lot!!!!!!

ADD REPLY
0
Entering edit mode

Glad that it worked. But it is not supposed to work that way given the data you posted in OP.

ADD REPLY
0
Entering edit mode

Yeah you are right head(test[,1]) Showed me the gene ID, but I have no idea, running the whole line, doesn't show the gene ID but anyway the good thing now it worked when making the first column as row names

However, there is another problem. When a gene has only one GO-term, it leaves it empty. But it should retain that only Go-label for the corresponding gene. right?

ADD REPLY
1
Entering edit mode

sure. It should do so. It also depends on you are reading input (txt) file in to R. I have created example file, where the GO term is present in only column, but absent in all other columns, for few GO columns. Follow the code below:

$ cat GO_test.txt 
GeneID  GO_01   GO_02   GO_03   GO_04
gene1   GO:0005575|0.853    GO:0005622|0.705    GO:0005623|0.846    GO:0005634|0.531
gene2   GO:0005575|0.324    GO:0044699|0.319    GO:0044464|0.324    GO:0005623|0.524
gene3   GO:0003674|0.356    GO:0005575|0.679    GO:0005622|0.539
gene4           GO:0005622|0.539
gene5       GO:0005575|0.679
gene6   GO:0005575|0.679

Gene 4 has entry in third column and all other columns are empty. Gene 5 has entry in 2nd column and all other columns are empty. Gene 6 has entry in 1st column and all other columns are empty. Here is the R code:

> test=read.csv("GO_test.txt", header = T, sep = "\t", strip.white = T, na.strings = "")
> test
  GeneID            GO_01            GO_02            GO_03            GO_04
1  gene1 GO:0005575|0.853 GO:0005622|0.705 GO:0005623|0.846 GO:0005634|0.531
2  gene2 GO:0005575|0.324 GO:0044699|0.319 GO:0044464|0.324 GO:0005623|0.524
3  gene3 GO:0003674|0.356 GO:0005575|0.679 GO:0005622|0.539             <NA>
4  gene4             <NA>             <NA> GO:0005622|0.539             <NA>
5  gene5             <NA> GO:0005575|0.679             <NA>             <NA>
6  gene6 GO:0005575|0.679             <NA>             <NA>             <NA>
> cbind("GeneID"=test[,1],"GO-term"=apply(test[,-1],1, function(x) x[order(as.numeric(sub("\\D+\\d+\\D","",x)),decreasing = T)][1]))
     GeneID  GO-term           
[1,] "gene1" "GO:0005575|0.853"
[2,] "gene2" "GO:0005623|0.524"
[3,] "gene3" "GO:0005575|0.679"
[4,] "gene4" "GO:0005622|0.539"
[5,] "gene5" "GO:0005575|0.679"
[6,] "gene6" "GO:0005575|0.679"
ADD REPLY
0
Entering edit mode

The code is perfectly working. It was something to do with the input file.

Thank you very much! I really appreciate your help :)

ADD REPLY
1
Entering edit mode

I moved @cpad112's comment to an answer. Please accept (green checkmark) to provide closure to this thread.

ADD REPLY
0
Entering edit mode

No problem. Keep visiting and contributing to Biostars.

ADD REPLY

Login before adding your answer.

Traffic: 2079 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6