Finding Information About Hypothetical Genes
4
0
Entering edit mode
12.4 years ago
Rubal7 ▴ 850

Hello All,

Does anyone have any advice on how to gather information about hypothetical genes, in the sense of predicted genes in the genome that are of unknown function. eg LOC100363218 ? Is there a way to find out the name of known genes with the highest percentage of sequence identity, which although I am aware is no guarantee of similar function, would provide at least some speculative information?

Thanks in advance for your help!

gene function genome sequence • 4.4k views
ADD COMMENT
1
Entering edit mode
12.4 years ago
Michael 55k

You can always run

  • BLAST (especially blastx the DNA sequence against NR)
  • run it through interproscan

again. The annotation as hypothetical gene might indicate that this has been tried already and no convincing hits were found. However, maybe the annotation is not updated recently, and the databases get updated more often, such that just recently a similar sequence has been added to NR (hopefully not yet another hypothetical protein).

ADD COMMENT
0
Entering edit mode

CG-Pipeline has several modules for annotation including specific modules for BLASTing to Uniprot and InterProScan. The BLAST module can be customized for another protein database such as NR.

This wouldn't be useful for just one specific protein (you should use the web interfaces if you are just performing a few queries) or if you are not familiar with Linux (I'd use CloVR or RAST if you are not familiar with Linux but have several queries). However, it is useful on a large-scale such as whole genome annotation. CG-Pipeline on the whole is optimized for prokaryotes, but for just getting an idea of a gene function, these modules should work well.

http://sourceforge.net/projects/cg-pipeline/

ADD REPLY
1
Entering edit mode
12.4 years ago
cdsouthan ★ 1.9k

The use of LOC numbers "hypothetical" and "model" can be confusing. You can see the criteria for generating LOC numbers in the Entrez gene guide but most of them are not proteins and this is labeled as a pseuodogene (http://www.ncbi.nlm.nih.gov/gene/100363218). Thus human Entrez gene is ~ 2x the number of protein coding loci. Some protein records are also labeled "hypothetical" even when the ORFs are strongly supported by many mRNA reads from large-scale cDNA projects, it may just mean they have never been curated by RefSeq or Swiss-Prot. As Michael says BLAST and InterProScan are key steps to discern if you have any protein similarity. Perhaps you could expand more on exactly what you want to to do and if you want to be gene-centric or protein-centric.

ADD COMMENT
0
Entering edit mode

Thanks for the explanation. I'm trying to find the potential functions of genes identified through whole genome scans, such as GWAS, so understand if the function of the gene identified would make biological sense as a candidate gene. Obviously when LOC numbers are hypothetical or model this is more challening.

ADD REPLY
1
Entering edit mode
12.4 years ago
Ashwin ▴ 110

One more thing you can do is, get location information for all hypothetical genes, Ensembl biomart has interface to get all overlapping genes. Ensembl is known to have more number of annotated genes than RefSeq. The solution is fully trivial ant may give you false positives, but its worth trying.

ADD COMMENT
0
Entering edit mode
12.4 years ago
cdsouthan ★ 1.9k

I would suggest your functional/mechanistic follow-ups of GWAS results should be hypothesis-neutral. Most of the associations scored for marker SNPS and/or haplotype blocks will not locate within gene loci anyway, or may act cis/trans remotely even if they did. Most GWAS results are reported gene-centrically because these are just the genomic signposts we happen know about. You will have to start bottom-up, with conserved patches being one of the key starting points.

ADD COMMENT
0
Entering edit mode

I completely agree. We are currently looking for changes in conserved positions. However once a loci is indentified we believe it is still worth understanding the function of hypothetical genes in these regions, in order to generate new hypotheses that can then be tested.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6