I have a set of genes with GO terms assigned to the genes. I want only the terminal GO terms for each gene (terminal nodes within GO DAG). For example the gene "PFC0155c" has the following GO terms assigned to it:
GO:0003677 F DNA binding
GO:0003899 F DNA-directed RNA polymerase activity
GO:0005198 F structural molecule activity
GO:0005622 C intracellular
GO:0005623 C cell
GO:0005634 C nucleus
GO:0005665 C DNA-directed RNA polymerase II, core complex
GO:0005730 C nucleolus
GO:0006139 P nucleobase-containing compound metabolic process
GO:0006351 P transcription, DNA-templated
GO:0006366 P transcription from RNA polymerase II promoter
GO:0006725 P cellular aromatic compound metabolic process
GO:0006807 P nitrogen compound metabolic process
GO:0008152 P metabolic process
GO:0009058 P biosynthetic process
GO:0009059 P macromolecule biosynthetic process
GO:0009987 P cellular process
GO:0010467 P gene expression
GO:0016070 P RNA metabolic process
GO:0018130 P heterocycle biosynthetic process
GO:0019438 P aromatic compound biosynthetic process
GO:0031974 C membrane-enclosed lumen
GO:0031981 C nuclear lumen
GO:0032774 P RNA biosynthetic process
GO:0032991 C macromolecular complex
GO:0034641 P cellular nitrogen compound metabolic process
GO:0034645 P cellular macromolecule biosynthetic process
GO:0034654 P nucleobase-containing compound biosynthetic process
GO:0043170 P macromolecule metabolic process
GO:0043226 C organelle
GO:0043227 C membrane-bounded organelle
GO:0043229 C intracellular organelle
GO:0043231 C intracellular membrane-bounded organelle
GO:0043233 C organelle lumen
GO:0043234 C protein complex
GO:0044237 P cellular metabolic process
GO:0044238 P primary metabolic process
GO:0044249 P cellular biosynthetic process
GO:0044260 P cellular macromolecule metabolic process
GO:0044271 P cellular nitrogen compound biosynthetic process
GO:0044422 C organelle part
GO:0044424 C intracellular part
GO:0044428 C nuclear part
GO:0044446 C intracellular organelle part
GO:0044464 C cell part
GO:0046483 P heterocycle metabolic process
GO:0070013 C intracellular organelle lumen
GO:0071704 P organic substance metabolic process
GO:0090304 P nucleic acid metabolic process
GO:1901360 P organic cyclic compound metabolic process
GO:1901362 P organic cyclic compound biosynthetic process
GO:1901576 P organic substance biosynthetic process
GO:1902494 C catalytic complex
GO:1990234 C transferase complex
When the above assigned GO terms are visualize in GO DAG (see picture below, colored boxes represent GO terms assigned to gene "PFC0155c"), there are 6 terminal GO terms (Reddish Ovals in picture) for the Biological Processes, Cellular Components, and Molecular Functions ontologies:
GO:0003677 F DNA binding
GO:0005198 F structural molecule activity
GO:0003899 F DNA-directed RNA polymerase activity
GO:0006366 P transcription from RNA polymerase II promoter
GO:0005730 C nucleolus
GO:0005665 C DNA-directed RNA polymerase II, core complex
The 6 terminal GO terms are the ones I want from the initial assigned list. I have 1600+ genes for which I want terminal GO terms from assigned GO terms.
Is there a way to automate this? Any ideas are welcomed. I would prefer to keep the association.
Thanks.
Hi Martin,
This is helpful, it should work. So for
CC=GOCCPARENTS, BP=GOBPPARENTS, MF=GOMFPARENTS
, is this for each GO terms and their ancestral/parent GO terms? If so can I get this information in a file?Thanks.
How did you created the object "db"?