extract information from Uniprot
2
0
Entering edit mode
6.0 years ago
Learner ▴ 280

I am wondering if anyone knows any program, script that one can use to retrieve over 100 gene information. Basically I want to get the info related to "Biological process", "Molecular function" and "Cellular component"

Thanks a bunch

genome • 3.0k views
ADD COMMENT
0
Entering edit mode

Can you explain what your input is? It may be a grep on a file in ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN but I can't tell from your question.

ADD REPLY
0
Entering edit mode

@Alex Reynolds the input can either be protein name or gene name. for instance, lets use a list of 7 genes from Human

ERVMER34-1
BMP4 
DNAJA1
ELANE
GZMB
RACK1
DNAJB1
ADD REPLY
0
Entering edit mode

@genomax this requires to go one by one in the Uniprot and then try to copy and paste the info from there. It is impossible when you have 100 or even more gene . Do you know a better way ?

ADD REPLY
0
Entering edit mode

These queries can be programmatically constructed. You will find help from UniProt here. They may also have a downloadble file on FTP site that could be queried. As Alex said other resources may have this information more readily available.

ADD REPLY
0
Entering edit mode

Google: retrieve uniprot mapping. Any luck?

Tell us what you have as your identifiers/ file formats. Print the head of your list/file.

ADD REPLY
0
Entering edit mode

@Biogeek I gave an example above. A list of genes and of course I could not find anything in google. Please use the following gene names as example

ERVMER34-1
BMP4 
DNAJA1
ELANE
GZMB
RACK1
DNAJB1

format can be txt, xls or whatever else if needed

ADD REPLY
3
Entering edit mode
6.0 years ago

Given a list of IDs:

$ cat /tmp/list.txt 
ERVMER34-1
BMP4 
DNAJA1
ELANE
GZMB
RACK1
DNAJB1

Grab the GAF file of UniProt id-to-GO mappings:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz | gunzip -c > /tmp/goa_human.gaf

Query your list of identifiers:

$ grep -wf /tmp/list.txt /tmp/goa_human.gaf > /tmp/query_results.txt

Use GO.db in R to read in GO data, and read your query results into a data frame to get mapped GO terms:

> library("GO.db")
> go_term_table <- toTable(GOTERM)
> df <- read.table("/tmp/query_results.txt", header=F, fill=T)
> ids <- unique(df$V4)
> unique_go_ids <- ids[grepl("^GO:", ids)]

You can then query the GO term table against your identifiers; for example, for the Biological Process ontology:

> biological_process <- go_term_table[go_term_table$Ontology == "BP" & go_term_table$go_id %in% unique_go_ids, ]

Repeat as needed for the other ontologies. Use write.table and similar to write R results to a file, if needed.

See: http://bioconductor.org/packages/release/data/annotation/html/GO.db.html for information on how to install GO.db.

ADD COMMENT
1
Entering edit mode

This was exceptionally helpful, and I appreciate you taking the time to write this out. I'll add for future individuals who come across this who have gene lists similar to the OPs - using fgrep instead of grep can lead to substantial increases in speed when the list.txt file is long.

https://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up

ADD REPLY
0
Entering edit mode

@Alex Reynolds do you know about the "Molecular function" and "Cellular component", I think I should use MF and CC

ADD REPLY
0
Entering edit mode

Seems reasonable to use.

ADD REPLY
0
Entering edit mode

@Alex Reynolds do you know how to understand which info I can extract from go_term_table ? actually I tried to list info using ?go_term_table or help but does not show anything. I also googled it with no success. I would appreciate if you could direct me to some info. basically I want to add the gene name to gene ID , definition etc

ADD REPLY
0
Entering edit mode

go_term_table is the name of a variable, so you're not going to get anything out of R from running ?go_term_table.

Run ?toTable if you want to learn about that command, but maybe start with the vignette and then read documentation about specific commands:

• https://www.bioconductor.org/packages/release/bioc/vignettes/annotate/inst/doc/GOusage.pdf

• http://bioconductor.org/packages/release/data/annotation/manuals/GO.db/man/GO.db.pdf

ADD REPLY
0
Entering edit mode

@Alex Reynolds Thanks for the link . is it possible somehow to keep the information from "query_results" merged with the GO? or at least seeing the gene name ? I think what you get from the first part is the GO ids and then you extract the data from GO.db.

ADD REPLY
0
Entering edit mode

Maybe use join functions to connect the go_term_table lookup with results from df (query_results.txt): https://dplyr.tidyverse.org/reference/join.html

I'd think you could join on the GO:xyz identifier, for instance.

ADD REPLY
0
Entering edit mode

@Alex Reynolds I think there are many genes are assigned to one GO, do you think it is possible to do that before you do this ? ids <- unique(df$V4)

ADD REPLY
0
Entering edit mode
6.0 years ago

U can use UniProt for a list click on retreive/ID mapping https://www.uniprot.org/uploadlists/ 1- enter yr list as a file or a copied text. 2- specify your list identifiers. In case of gene name U optionally can specify a species other wise all species contain these gene name will be included in yr result.

you can control what is in yr results table. U need BP. MF, and CC so you need to edit the columns to view them so tick them from Gene Ontology GO tab.

https://www.uniprot.org/uniprot/?query=yourlist:M201812066746803381A1F0E0DB47453E0216320D06CFD34&sort=yourlist:M201812066746803381A1F0E0DB47453E0216320D06CFD34

ADD COMMENT

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6