I am wondering if anyone knows any program, script that one can use to retrieve over 100 gene information. Basically I want to get the info related to "Biological process", "Molecular function" and "Cellular component"
Thanks a bunch
I am wondering if anyone knows any program, script that one can use to retrieve over 100 gene information. Basically I want to get the info related to "Biological process", "Molecular function" and "Cellular component"
Thanks a bunch
Given a list of IDs:
$ cat /tmp/list.txt
ERVMER34-1
BMP4
DNAJA1
ELANE
GZMB
RACK1
DNAJB1
Grab the GAF file of UniProt id-to-GO mappings:
$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz | gunzip -c > /tmp/goa_human.gaf
Query your list of identifiers:
$ grep -wf /tmp/list.txt /tmp/goa_human.gaf > /tmp/query_results.txt
Use GO.db
in R to read in GO data, and read your query results into a data frame to get mapped GO terms:
> library("GO.db")
> go_term_table <- toTable(GOTERM)
> df <- read.table("/tmp/query_results.txt", header=F, fill=T)
> ids <- unique(df$V4)
> unique_go_ids <- ids[grepl("^GO:", ids)]
You can then query the GO term table against your identifiers; for example, for the Biological Process ontology:
> biological_process <- go_term_table[go_term_table$Ontology == "BP" & go_term_table$go_id %in% unique_go_ids, ]
Repeat as needed for the other ontologies. Use write.table
and similar to write R results to a file, if needed.
See: http://bioconductor.org/packages/release/data/annotation/html/GO.db.html for information on how to install GO.db
.
This was exceptionally helpful, and I appreciate you taking the time to write this out. I'll add for future individuals who come across this who have gene lists similar to the OPs - using fgrep instead of grep can lead to substantial increases in speed when the list.txt file is long.
https://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up
@Alex Reynolds do you know how to understand which info I can extract from go_term_table ? actually I tried to list info using ?go_term_table or help but does not show anything. I also googled it with no success. I would appreciate if you could direct me to some info. basically I want to add the gene name to gene ID , definition etc
go_term_table
is the name of a variable, so you're not going to get anything out of R from running ?go_term_table
.
Run ?toTable
if you want to learn about that command, but maybe start with the vignette and then read documentation about specific commands:
• https://www.bioconductor.org/packages/release/bioc/vignettes/annotate/inst/doc/GOusage.pdf
• http://bioconductor.org/packages/release/data/annotation/manuals/GO.db/man/GO.db.pdf
Maybe use join functions to connect the go_term_table
lookup with results from df
(query_results.txt
): https://dplyr.tidyverse.org/reference/join.html
I'd think you could join on the GO:xyz
identifier, for instance.
U can use UniProt for a list click on retreive/ID mapping https://www.uniprot.org/uploadlists/ 1- enter yr list as a file or a copied text. 2- specify your list identifiers. In case of gene name U optionally can specify a species other wise all species contain these gene name will be included in yr result.
you can control what is in yr results table. U need BP. MF, and CC so you need to edit the columns to view them so tick them from Gene Ontology GO tab.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you explain what your input is? It may be a grep on a file in ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN but I can't tell from your question.
@Alex Reynolds the input can either be protein name or gene name. for instance, lets use a list of 7 genes from Human
https://www.uniprot.org/uniprot/?query=gene:BMP4+AND+reviewed:yes+AND+organism:9606#goViewBy
https://www.uniprot.org/uniprot/?query=gene:ELANE+AND+reviewed:yes+AND+organism:9606#goViewBy
Construct others as needed.
@genomax this requires to go one by one in the Uniprot and then try to copy and paste the info from there. It is impossible when you have 100 or even more gene . Do you know a better way ?
These queries can be programmatically constructed. You will find help from UniProt here. They may also have a downloadble file on FTP site that could be queried. As Alex said other resources may have this information more readily available.
Google: retrieve uniprot mapping. Any luck?
Tell us what you have as your identifiers/ file formats. Print the head of your list/file.
@Biogeek I gave an example above. A list of genes and of course I could not find anything in google. Please use the following gene names as example
format can be txt, xls or whatever else if needed