gene - GO file, select most detailed annotation
1
0
Entering edit mode
7.4 years ago
boczniak767 ▴ 870

Hi,

I have a file with two columns: gene and GO-slim ID, there are variable number of GOs for each gene in new line. So in each line I have gene Id and GO-slim ID.

How could I for each gene select only most detailed GO term and slice the file to three according to category (cellular component, biological process, molecular function)?

GO • 1.6k views
ADD COMMENT
0
Entering edit mode

1) Are you able to assign each GO-slim ID to CC, BP and MF? This would be a first step. 2) In which programming language would you like to work?

ADD REPLY
0
Entering edit mode

I have only basics of bash commands, plus awk, plus sed. I don't have CC/MF/BP assignment right now.

ADD REPLY
0
Entering edit mode

My file is rather in that form

gene1 GO1
gene1 GO2
gene1 GO3
gene2 GO1

I need to split it to classes to summarize function of my set of genes. So I try to get most detailed category to which given gene is annotated.

I've forget to add that I use GO-slims. But it probably don't make difference.

ADD REPLY
1
Entering edit mode
7.4 years ago
Fabio Marroni ★ 3.0k

I post here an R script I had for performing the splitting you require. However be adviced that: 1) I never tried to remove GO_IDs as you want. however, it should be easy for you to find the GO_IDs you want to remove (keep) and to that with grep. 2) I never tried to work separately on CC,BP and MF. However, again creating three separate files should be easy once you find the list of GO_IDs belonging to each category. Are you sure you want to do that? Many enrichment analysis tools can easily separate IDs by category for you.

Imagine you have a file like the one you described and it's called test.go.txt with the following format:

Gene    GO_ID
gene1   go1;go2;go3;g04
gene2   go6;g07;g08
gene3   go89;go1

You just need to run the script below

pp<-read.table("test.go.txt",header=T,stringsAsFactors=F)
gino<-strsplit(pp$GO_ID,";")
names(gino)<-pp$Gene
pino<-rep(names(gino),lapply(gino,length))
rino<-data.frame(pino,unlist(gino,use.names=F))
names(rino)<-c("Gene","GO_ID")

And you get the following result:

> rino
   Gene GO_ID
1 gene1   go1
2 gene1   go2
3 gene1   go3
4 gene1   g04
5 gene2   go6
6 gene2   g07
7 gene2   g08
8 gene3  go89
9 gene3   go1

Hope this helps!

ADD COMMENT

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6