Question

Gene Position Info From Encode Project

1

Entering edit mode

11.9 years ago

J.F.Jiang ▴ 930

Hi All,

I am working on SNP annotation which needs a gene annotation file containing the postition infomation of the gene region.

Previously people always use the file from UCSC hg18 refGene to extract the info.

Here, as mentioned in many papers, ENCODE V7 gene is a better resources for the gene annotation.

In the website:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ several links have been offered, of which the wgEncodeGencodeBasicV7 should be the one that I look for?

wgEncodeGencodeBasicV7.gp.gz project=wgEncode; grant=Hubbard; lab=Sanger; composite=wgEncodeGencodeV7; dataType=Gencode; dccAccession=wgEncodeEH001881; dateSubmitted=2011-05-01; subId=4347; labExpId=V7; labVersion=Basic Gene Annotation Set; tableName=wgEncodeGencodeBasicV7; type=gp; md5sum=ee1cdaa985ca47337dff5efb0cafb3ed; size=4.5M

BTW:

There are also other annotation levels in the website:

  wgEncodeGencodeV4/         05-Jul-2012 06:57    -   
  wgEncodeGencodeV7/         05-Jul-2012 06:57    -   
  wgEncodeGencodeV10/        22-Feb-2012 13:40    -   
  wgEncodeGencodeV11/        08-Mar-2012 09:44    -   
  wgEncodeGencodeV12/        20-Jun-2012 16:40    -

So what's the difference between these datasets? Since almost all the publications only considered the V7 set

Thanks anyway!

Best

Sorry, forget to ask this: In the data file, many genes may have different transcripts which may have different txStart positions So, basically, how to define the gene region?

gene annotation encode position • 2.2k views

ADD COMMENT • link 11.9 years ago by J.F.Jiang ▴ 930

score 2 · Answer 1 · 2013-01-12

2

Entering edit mode

11.9 years ago

PoGibas 5.1k

Current release is Gencode v14.
You should also check statistics to see how each version differs from the others.

Hope this helps.

UPDATE:
You can try gencode.v14.annotation.gtf.gz. It has tracks for various gene types: protein coding, lncRNA, mirRNA, pseudogenes etc.
If, for example, you're interested in protein coding genes you can extract coordinates for all gene region or only it's exons, transcripts, UTR's (usually the defined gene region is the same or larger than it's transcripts).

ADD COMMENT • link 11.9 years ago by PoGibas 5.1k

0

Entering edit mode

Yes, thanks, that is really helpful

However, which kind of criteria is to choose the version of the datasets? Could you give me some insights?

ADD REPLY • link 11.9 years ago by J.F.Jiang ▴ 930

0

Entering edit mode

The annotations changes frequently. You want to pick the one which is stable and freeze it. Then you can do all the analysis with that version.