Entering edit mode
5.5 years ago
O.rka
▴
740
I'm trying to download a flat file that has the following info:
- GOTERM
- GOTERM DESCRIPTION
- GOTERM SET (Biological process, molecular functions, cellular components)
- GENE LIST (either in EcoCyc (e.g. EG10894) , Uniprot (e.g. P0A8V2), or Blattner (e.g. b3987).
Preferably a flat file that I could download from a website but open to Python
or R
as well.
I have access to EcoCyc
flat files but I can't find anything about GO terms
in there; though, they are on the website.
Does anyone know where/how I can do this?
Thanks for this.
GeneSCF
looks like a good but the formatting is extremely unusual. For exampleGO:0000049tyajQ,tsaC,trmA,selB,truA,trmO,rlmN,dusC,tmcA,truB,arfA,thiI,rplP,epmA,lysS,lysU,tmolecular_function~
there seems to be multiple delimiteds liket
and,
. I could create a parser for this but I don't want to create an error not knowing all of the rules (as this is only a single case). Is there a way to get this into a more consistent format that I could load into a dataframe?Hi,
Glad that GeneSCF was helpful. The downloaded file with 'prepare_database' format follows these rules,
There is a
t
instead oftab
character on my download but I think it should be ok. Are GO characters always 7 digits?Are GO characters always 7 digits? YES
There is a t instead of tab character on my download but I think it should be ok.
I just downloaded fresh version of GeneSCF and verified if there are any problem as you mentioned. I am not able to reproduce your error or misformat issues. I am attaching the screenshot of downloaded sample results from GeneSCF for ecocyc.
Thanks again, this is extremely helpful. I ran it on OSX but I will run in again when I get to lab tomorrow on my Linux machine.
Yes, that will solve the issue.