Question

Where can I download GO terms and their associated E. coli genes?

0

Entering edit mode

5.8 years ago

O.rka ▴ 740

I'm trying to download a flat file that has the following info:

GOTERM
GOTERM DESCRIPTION
GOTERM SET (Biological process, molecular functions, cellular components)
GENE LIST (either in EcoCyc (e.g. EG10894) , Uniprot (e.g. P0A8V2), or Blattner (e.g. b3987).

Preferably a flat file that I could download from a website but open to Python or R as well.

I have access to EcoCyc flat files but I can't find anything about GO terms in there; though, they are on the website.

Does anyone know where/how I can do this?

gene • 3.9k views

ADD COMMENT • link updated 5.8 years ago by EagleEye 7.6k • written 5.8 years ago by O.rka ▴ 740

0

Entering edit mode

5.8 years ago

AK ★ 2.2k

A tricky way but works well using R:

> library(tidyverse)
> library(GO.db)
> uniprot_id2go <-
+   read_tsv("https://www.uniprot.org/uniprot/?query=organism:83333&format=tab&columns=id,go-id") %>%
+   separate_rows(., `Gene ontology IDs`, sep = "; ") %>%
+   as.data.frame()
Parsed with column specification:
cols(
  Entry = col_character(),
  `Gene ontology IDs` = col_character()
)
> uniprot_id2go$Desc <- Term(uniprot_id2go$`Gene ontology IDs`)
> uniprot_id2go$Ontology <- Ontology(uniprot_id2go$`Gene ontology IDs`)
> str(uniprot_id2go)
'data.frame':   21436 obs. of  4 variables:
 $ Entry            : chr  "P07813" "P07813" "P07813" "P07813" ...
 $ Gene ontology IDs: chr  "GO:0002161" "GO:0004823" "GO:0005524" "GO:0005829" ...
 $ Desc             : chr  "aminoacyl-tRNA editing activity" "leucine-tRNA ligase activity" "ATP binding" "cytosol" ...
 $ Ontology         : chr  "MF" "MF" "MF" "CC" ...
> uniprot_id2go %>% filter(Entry == "P0A8V2")
   Entry Gene ontology IDs                                       Desc Ontology
1 P0A8V2        GO:0003677                                DNA binding       MF
2 P0A8V2        GO:0003899 DNA-directed 5'-3' RNA polymerase activity       MF
3 P0A8V2        GO:0005737                                  cytoplasm       CC
4 P0A8V2        GO:0005829                                    cytosol       CC
5 P0A8V2        GO:0006351               transcription, DNA-templated       BP
6 P0A8V2        GO:0016020                                   membrane       CC
7 P0A8V2        GO:0032549                     ribonucleoside binding       MF

Hope it helps.

ADD COMMENT • link 5.8 years ago by AK ★ 2.2k

score 2 · Accepted Answer · 2019-05-25

2

Entering edit mode

5.8 years ago

EagleEye 7.6k

Have a look at this posts,

A: How To Get Gene List From Each Gene Ontology Term?

A: How to look up GO terms associated to a certain organism?

ADD COMMENT • link 5.8 years ago by EagleEye 7.6k

0

Entering edit mode

Thanks for this. GeneSCF looks like a good but the formatting is extremely unusual. For example GO:0000049tyajQ,tsaC,trmA,selB,truA,trmO,rlmN,dusC,tmcA,truB,arfA,thiI,rplP,epmA,lysS,lysU,tmolecular_function~ there seems to be multiple delimiteds like t and ,. I could create a parser for this but I don't want to create an error not knowing all of the rules (as this is only a single case). Is there a way to get this into a more consistent format that I could load into a dataframe?

ADD REPLY • link 5.8 years ago by O.rka ▴ 740

0

Entering edit mode

Hi,

Glad that GeneSCF was helpful. The downloaded file with 'prepare_database' format follows these rules,

GOID1~GONAME1<TAB>Gene1,Gene2
GOID2~GONAME2<TAB>Gene1,Gene2

ADD REPLY • link 5.8 years ago by EagleEye 7.6k

0

Entering edit mode

There is a t instead of tab character on my download but I think it should be ok. Are GO characters always 7 digits?

ADD REPLY • link 5.8 years ago by O.rka ▴ 740

1

Entering edit mode

Are GO characters always 7 digits? YES

There is a t instead of tab character on my download but I think it should be ok.

Warning: Make sure to check system requirements to run GeneSCF. GeneSCF only works on Linux system, it has been successfully tested on Ubuntu, Mint,Cent OS and Windows 10 bash (version 1607 and above). Other distributions of Linux might work as well.

I just downloaded fresh version of GeneSCF and verified if there are any problem as you mentioned. I am not able to reproduce your error or misformat issues. I am attaching the screenshot of downloaded sample results from GeneSCF for ecocyc.

./prepare_database -db=GO_all -org=ecocyc
Downloading GO database....
Extracting ecocyc information...
Updating gene information...
Do not panic. The processing is going on...
Database retreived..You are now ready to use geneSCF with organism ecocyc from --database GO
Done....Mon May 27 22:52:47 CEST 2019

enter image description here

ADD REPLY • link 5.8 years ago by EagleEye 7.6k

1

Entering edit mode

Thanks again, this is extremely helpful. I ran it on OSX but I will run in again when I get to lab tomorrow on my Linux machine.