I am trying to get a gene summary to annotate variants in exome sequencing data. My genes Id are based on HGNC nomenclature. I am looking to get a gene summary similar to what is available in www.genecards.org (example : http://www.genecards.org/cgi-bin/carddisp.pl?gene=MAGI2#sum).
GeneCards database uses Entrez gene and other sources to feed their summary section. So, I decided to go for Entrez gene and found this file on their FTP (ftp://ftp.ncbi.nih.gov/gene/DATA/GENEINFO/Mammalia/Homosapiens.gene_info.gz). However, the file contains very short description for any given gene compared to what is available in GeneCards website.
I tried BioMart for Ensebml but I also had the same issue (using Ensembl definition, GO terms and various other attributes didn't yield much information).
Gene Ontology is probably the industry standard. There you can get structured and well annotated functions for just about any gene. These are perfect for programatically determining over-represented functions in a list of genes.
If you're looking for more of a "free-form" description I would suggest WikiGenes. Its not nearly as complete but it does have quite a bit of information for most genes.
Thanks Will. I will use GO for now. Just out of curiosity, any ideas how to get the gene summary in RefSeq or Entrez Gene databases either via SQL , E-utilities or directly from FTP?
Refseq or the resources like GeneCards that depends on RefSeq is not always the best and up-to-date source for gene function. I would recommend a combination of resources like AmiGO, NCBI-Gene (see Related articles in PubMed, GeneRIFs, Phenotypes and interactions sections), GeneWiki, BioGPS, iHOP etc for a better understanding of the function.
Here is an example:
Take a look at the Refseq annotation for TRIM38 gene in NCBI / GeneCards
See RefSeq Summary:
The protein encoded by this gene is a
member of the tripartite motif (TRIM)
family. The TRIM motif includes three
zinc-binding domains, a RING, a B-box
type 1 and a B-box type 2, and a
coiled-coil region. The function of
this protein has not been identified.
[provided by RefSeq]
But there are several experimentally verified function ascribed to this gene in GOA.
Here also you should be aware that GO annotation is rapidly evolving and GO annotation may not expain the complete functional spectrum of a given gene. It is always good to check the Related articles in PubMed, GeneRIFs, Phenotypes and interactions sections in NCBI-Gene page for the functional aspects not captured by GO.
Until there is a community-wide agreement or standard on reporting biological function in manuscripts, the best bet will be consulting various resources to get a cohesive view of functions.
Many thanks for the links Khader. This approach , unlike using GO, seems practically difficult to annotate genes in whole exome sequencing data. However, I can see how this approach can be very useful when there is a compelling candidate gene (or few genes) to investigate for more details.
Actually UniProt is a combination of the high quality SwissProt and PIR data, and low quality trEMBL data. For purposes like this you will want to check the source.
You could try GoGene, which take a gene name as input and then categorizes (e.g. by biological process) and summarizes (e.g. number of abstracts per category term) the abstracts in PubMed associated with that gene name.
I have no idea what MAGI2 does, but GoGene says its likely to have guanylate kinase activity,
a PDZ domain binding, involved in phosphorylation, and is found at synapses, the cell membrane, and intercellular Junctions.
If you look at the bottom of a GoGene result page, there are a few links (to SIF GML GraphML & PubMedIDs). These link out to URLs that appear to call a RESTful style API, e.g. for Magi2 in SIF format: http://projects.biotec.tu-dresden.de/gogene/gogene/Search/SIF?q=magi2&type=SIFExportAll Therefore, you should be able to access GoGene data programmatically via wget.
Thanks Casey. However, I'm looking for resources that can be accessed programmatically to annotate thousand of genes rather than manually searching one by one. GoGene dosen't seem to have API or downloadable files.
Don't forget the genetics perspective. The earlier responses are all valid and useful to predict function or transfer function from a known or tested gene to one that is highly similar in primary sequence. Genetics - via knockout (KO) or knockdown (with interfering RNA), or over-expression - can also reveal phenotypes and other functional characteristics.
Knowing that gene YFG encodes an enzyme that converts A + B to C + H[?]2[?]O is one important aspect of function. Being able to say that YFG also has a role in muscle cell development as revealed by RNAi expts adds another dimension to function. Just as we can transfer GO annotation from a gene whose product was tested in a lab to another (highly) similar gene, we can do the same with phenotypes from genetics expts. So, one can mine mouse KO data to gain insight into human gene function.
Bulk Download Of Ncbi Gene "Summary" Field
Thanks Will. I will use GO for now. Just out of curiosity, any ideas how to get the gene summary in RefSeq or Entrez Gene databases either via SQL , E-utilities or directly from FTP?