Human Genome Annotations
2
1
Entering edit mode
13.6 years ago

I am trying to do a comparison of gene annotations for the latest release of the human genome with annotations from the previous release. I have rarely worked with human data before, so I wasn't sure where to start. I found this thread which provides a link for downloading some custom-generated GFF3 for the hg19 release (this seems to be the latest "official" release).

Getting data for the hg18 release hasn't been so easy. I checked out UCSC's download site, but found it very difficult to navigate. So then I tried Ensembl's FTP site and found the data ordered by date and organism (not labels like "hg18" or "hg19"). UCSC's site lists dates of the human genome releases, so I guess I could just download the annotations for the closest following Ensembl release...but then again, the dates on UCSC's site aren't exact and I'm not sure how quickly these data are integrated into the Ensembl data bank.

Does anyone have any tips for obtaining gene annotations for different releases of the human genome? Is there some simple documentation I'm missing, or is everything really as complicated as it seems?

human gff annotation gene genome • 7.7k views
ADD COMMENT
0
Entering edit mode

What source/format are your current annotations in for comparison?

ADD REPLY
0
Entering edit mode

@pi All I currently have is the GFF3 file of the hg19 release.

ADD REPLY
4
Entering edit mode
13.6 years ago
Bert Overduin ★ 3.7k

A few pointers:

  • hg18 = NCBI36, hg19 = GRCh37
  • Which release of Ensembl is based on which genome assembly you can find when you click on the 'View in Archive site' link at the bottom of this page
  • Note that regularly a new Ensembl genebuild is done for human (so, not only when there is a new assembly!) and that even in between genebuilds the gene set is updated / patched. Therefore, almost every release has a different gene set.
  • Note also that the way Ensembl annotates genes is different from UCSC and that the Ensembl automatic annotation is merged with manual annotation from the Havana group at the Sanger Institute. A basic outline of the basic annotation process you can find here. There are separate annotation strategies for immunoglobulin and T-cell receptor genes and non-coding RNA genes.

So, I am afraid that things are probably more complicated than you had hoped for ....

Hope this helps.

ADD COMMENT
2
Entering edit mode
13.6 years ago
brentp 24k

At UCSC table browser you can download Bed format of the various human releases that include exons in the extended bed columns or each as a seperate row. I think this would be the easiest plce to start. GFF makes things more complicated.

ADD COMMENT
0
Entering edit mode

@brentp GFF makes it more complicated...as in it's a more complicated format (than BED) or it's more complicated to obtain GFF3 for human (than BED)?

ADD REPLY
0
Entering edit mode

@Daniel, both. Though you can get a nice GTF file from ensembl for GRCh37. If you download the whole gene BED from UCSC, it will likely have everything you need.

ADD REPLY

Login before adding your answer.

Traffic: 1665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6