I post here an R script I had for performing the splitting you require.
However be adviced that:
1) I never tried to remove GO_IDs as you want. however, it should be easy for you to find the GO_IDs you want to remove (keep) and to that with grep.
2) I never tried to work separately on CC,BP and MF. However, again creating three separate files should be easy once you find the list of GO_IDs belonging to each category. Are you sure you want to do that? Many enrichment analysis tools can easily separate IDs by category for you.
Imagine you have a file like the one you described and it's called test.go.txt with the following format:
Gene GO_ID
gene1 go1;go2;go3;g04
gene2 go6;g07;g08
gene3 go89;go1
You just need to run the script below
pp<-read.table("test.go.txt",header=T,stringsAsFactors=F)
gino<-strsplit(pp$GO_ID,";")
names(gino)<-pp$Gene
pino<-rep(names(gino),lapply(gino,length))
rino<-data.frame(pino,unlist(gino,use.names=F))
names(rino)<-c("Gene","GO_ID")
And you get the following result:
> rino
Gene GO_ID
1 gene1 go1
2 gene1 go2
3 gene1 go3
4 gene1 g04
5 gene2 go6
6 gene2 g07
7 gene2 g08
8 gene3 go89
9 gene3 go1
Hope this helps!
1) Are you able to assign each GO-slim ID to CC, BP and MF? This would be a first step. 2) In which programming language would you like to work?
I have only basics of
bash
commands, plusawk
, plussed
. I don't have CC/MF/BP assignment right now.My file is rather in that form
gene1 GO1
gene1 GO2
gene1 GO3
gene2 GO1
I need to split it to classes to summarize function of my set of genes. So I try to get most detailed category to which given gene is annotated.
I've forget to add that I use GO-slims. But it probably don't make difference.