Where can I download GO terms and their associated E. coli genes?
2
0
Entering edit mode
5.6 years ago
O.rka ▴ 740

I'm trying to download a flat file that has the following info:

  1. GOTERM
  2. GOTERM DESCRIPTION
  3. GOTERM SET (Biological process, molecular functions, cellular components)
  4. GENE LIST (either in EcoCyc (e.g. EG10894) , Uniprot (e.g. P0A8V2), or Blattner (e.g. b3987).

Preferably a flat file that I could download from a website but open to Python or R as well.

I have access to EcoCyc flat files but I can't find anything about GO terms in there; though, they are on the website.

Does anyone know where/how I can do this?

gene • 3.7k views
ADD COMMENT
2
0
Entering edit mode

Thanks for this. GeneSCF looks like a good but the formatting is extremely unusual. For example GO:0000049tyajQ,tsaC,trmA,selB,truA,trmO,rlmN,dusC,tmcA,truB,arfA,thiI,rplP,epmA,lysS,lysU,tmolecular_function~ there seems to be multiple delimiteds like t and ,. I could create a parser for this but I don't want to create an error not knowing all of the rules (as this is only a single case). Is there a way to get this into a more consistent format that I could load into a dataframe?

ADD REPLY
0
Entering edit mode

Hi,

Glad that GeneSCF was helpful. The downloaded file with 'prepare_database' format follows these rules,

GOID1~GONAME1<TAB>Gene1,Gene2
GOID2~GONAME2<TAB>Gene1,Gene2
ADD REPLY
0
Entering edit mode

There is a t instead of tab character on my download but I think it should be ok. Are GO characters always 7 digits?

ADD REPLY
1
Entering edit mode

Are GO characters always 7 digits? YES

There is a t instead of tab character on my download but I think it should be ok.

Warning: Make sure to check system requirements to run GeneSCF. GeneSCF only works on Linux system, it has been successfully tested on Ubuntu, Mint,Cent OS and Windows 10 bash (version 1607 and above). Other distributions of Linux might work as well.

I just downloaded fresh version of GeneSCF and verified if there are any problem as you mentioned. I am not able to reproduce your error or misformat issues. I am attaching the screenshot of downloaded sample results from GeneSCF for ecocyc.

./prepare_database -db=GO_all -org=ecocyc
Downloading GO database....
Extracting ecocyc information...
Updating gene information...
Do not panic. The processing is going on...
Database retreived..You are now ready to use geneSCF with organism ecocyc from --database GO
Done....Mon May 27 22:52:47 CEST 2019

enter image description here

ADD REPLY
1
Entering edit mode

Thanks again, this is extremely helpful. I ran it on OSX but I will run in again when I get to lab tomorrow on my Linux machine.

ADD REPLY
0
Entering edit mode

Yes, that will solve the issue.

ADD REPLY
0
Entering edit mode
5.6 years ago
AK ★ 2.2k

A tricky way but works well using R:

> library(tidyverse)
> library(GO.db)
> uniprot_id2go <-
+   read_tsv("https://www.uniprot.org/uniprot/?query=organism:83333&format=tab&columns=id,go-id") %>%
+   separate_rows(., `Gene ontology IDs`, sep = "; ") %>%
+   as.data.frame()
Parsed with column specification:
cols(
  Entry = col_character(),
  `Gene ontology IDs` = col_character()
)
> uniprot_id2go$Desc <- Term(uniprot_id2go$`Gene ontology IDs`)
> uniprot_id2go$Ontology <- Ontology(uniprot_id2go$`Gene ontology IDs`)
> str(uniprot_id2go)
'data.frame':   21436 obs. of  4 variables:
 $ Entry            : chr  "P07813" "P07813" "P07813" "P07813" ...
 $ Gene ontology IDs: chr  "GO:0002161" "GO:0004823" "GO:0005524" "GO:0005829" ...
 $ Desc             : chr  "aminoacyl-tRNA editing activity" "leucine-tRNA ligase activity" "ATP binding" "cytosol" ...
 $ Ontology         : chr  "MF" "MF" "MF" "CC" ...
> uniprot_id2go %>% filter(Entry == "P0A8V2")
   Entry Gene ontology IDs                                       Desc Ontology
1 P0A8V2        GO:0003677                                DNA binding       MF
2 P0A8V2        GO:0003899 DNA-directed 5'-3' RNA polymerase activity       MF
3 P0A8V2        GO:0005737                                  cytoplasm       CC
4 P0A8V2        GO:0005829                                    cytosol       CC
5 P0A8V2        GO:0006351               transcription, DNA-templated       BP
6 P0A8V2        GO:0016020                                   membrane       CC
7 P0A8V2        GO:0032549                     ribonucleoside binding       MF

Hope it helps.

ADD COMMENT

Login before adding your answer.

Traffic: 2995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6