Hi All,
I am working on SNP annotation which needs a gene annotation file containing the postition infomation of the gene region.
Previously people always use the file from UCSC hg18 refGene to extract the info.
Here, as mentioned in many papers, ENCODE V7 gene is a better resources for the gene annotation.
In the website:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ several links have been offered, of which the wgEncodeGencodeBasicV7 should be the one that I look for?
wgEncodeGencodeBasicV7.gp.gz project=wgEncode; grant=Hubbard; lab=Sanger; composite=wgEncodeGencodeV7; dataType=Gencode; dccAccession=wgEncodeEH001881; dateSubmitted=2011-05-01; subId=4347; labExpId=V7; labVersion=Basic Gene Annotation Set; tableName=wgEncodeGencodeBasicV7; type=gp; md5sum=ee1cdaa985ca47337dff5efb0cafb3ed; size=4.5M
BTW:
There are also other annotation levels in the website:
wgEncodeGencodeV4/ 05-Jul-2012 06:57 -
wgEncodeGencodeV7/ 05-Jul-2012 06:57 -
wgEncodeGencodeV10/ 22-Feb-2012 13:40 -
wgEncodeGencodeV11/ 08-Mar-2012 09:44 -
wgEncodeGencodeV12/ 20-Jun-2012 16:40 -
So what's the difference between these datasets? Since almost all the publications only considered the V7 set
Thanks anyway!
Best
<h6>#########Add in 1/12</h6>Sorry, forget to ask this: In the data file, many genes may have different transcripts which may have different txStart positions So, basically, how to define the gene region?
Yes, thanks, that is really helpful
However, which kind of criteria is to choose the version of the datasets? Could you give me some insights?
The annotations changes frequently. You want to pick the one which is stable and freeze it. Then you can do all the analysis with that version.
I updated my previous answer.