Question

gene - GO file, select most detailed annotation

0

Entering edit mode

7.6 years ago

boczniak767 ▴ 870

Hi,

I have a file with two columns: gene and GO-slim ID, there are variable number of GOs for each gene in new line. So in each line I have gene Id and GO-slim ID.

How could I for each gene select only most detailed GO term and slice the file to three according to category (cellular component, biological process, molecular function)?

GO • 1.7k views

ADD COMMENT • link 7.6 years ago by boczniak767 ▴ 870

0

Entering edit mode

1) Are you able to assign each GO-slim ID to CC, BP and MF? This would be a first step. 2) In which programming language would you like to work?

ADD REPLY • link 7.6 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

I have only basics of bash commands, plus awk, plus sed. I don't have CC/MF/BP assignment right now.

ADD REPLY • link 7.6 years ago by boczniak767 ▴ 870

0

Entering edit mode

My file is rather in that form

gene1 GO1
gene1 GO2
gene1 GO3
gene2 GO1

I need to split it to classes to summarize function of my set of genes. So I try to get most detailed category to which given gene is annotated.

I've forget to add that I use GO-slims. But it probably don't make difference.

ADD REPLY • link 7.6 years ago by boczniak767 ▴ 870

score 1 · Answer 1 · 2017-07-03

I post here an R script I had for performing the splitting you require. However be adviced that: 1) I never tried to remove GO_IDs as you want. however, it should be easy for you to find the GO_IDs you want to remove (keep) and to that with grep. 2) I never tried to work separately on CC,BP and MF. However, again creating three separate files should be easy once you find the list of GO_IDs belonging to each category. Are you sure you want to do that? Many enrichment analysis tools can easily separate IDs by category for you.

Imagine you have a file like the one you described and it's called test.go.txt with the following format:

Gene    GO_ID
gene1   go1;go2;go3;g04
gene2   go6;g07;g08
gene3   go89;go1

You just need to run the script below

pp<-read.table("test.go.txt",header=T,stringsAsFactors=F)
gino<-strsplit(pp$GO_ID,";")
names(gino)<-pp$Gene
pino<-rep(names(gino),lapply(gino,length))
rino<-data.frame(pino,unlist(gino,use.names=F))
names(rino)<-c("Gene","GO_ID")

And you get the following result:

> rino
   Gene GO_ID
1 gene1   go1
2 gene1   go2
3 gene1   go3
4 gene1   g04
5 gene2   go6
6 gene2   g07
7 gene2   g08
8 gene3  go89
9 gene3   go1

Hope this helps!