Which source of annotation files to use, Ensembl or UCSC?
1
0
Entering edit mode
8.3 years ago
epigene ▴ 590

I'm getting really confused with different annotation files from UCSC and Ensembl, with their gene/exon IDs. I'm wondering if there is a good tutorial or paper on explaining the best usage/practice with them? Specifically, I'm interested in analyzing RNA-seq data on zebrafish and human, which source would be better to use?

Thanks.

RNA-Seq ucsc ensembl • 6.3k views
ADD COMMENT
3
Entering edit mode

Use either. Stick with sequence/annotation from one you select until the end.

ADD REPLY
2
Entering edit mode
ADD REPLY
2
Entering edit mode

hi, There is an issue with UCSC human gene annotation, in case you want to use their GTF file (dwnldbl from TableBrowser) for passing to a RNA-seq aligner. The UCSC GTF has the gene_id same as the transcript_id. This might be a ignorable thing but not if you decide to do transcript isoform level quantification (using Cufflinks, StringTie etc.) in addition to gene level. These transcript assemblers expect a hierarchical info. (which I think is the idea of a GTF). That is multiple transcript isoforms (if present) with unique transcript_id assigned to a shared gene_id. This is absent in UCSC (TableBrowser) GTF, at least for the human. I can't comment for the zebrafish anno. Here are e.g. lines from UCSC GTF and Ensembl GTF (both GRCh37/ hg19 ver.) -

$ head UCSC_hg19
chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; 
chr1    hg19_knownGene  exon    12613   12721   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; 
chr1    hg19_knownGene  exon    13221   14409   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; 
chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
chr1    hg19_knownGene  exon    12646   12697   0.000000    +   .   gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
chr1    hg19_knownGene  exon    13221   14409   0.000000    +   .   gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; 
chr1    hg19_knownGene  start_codon 12190   12192   0.000000    +   .   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1    hg19_knownGene  CDS 12190   12227   0.000000    +   0   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 
chr1    hg19_knownGene  CDS 12595   12721   0.000000    +   1   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; 


$ head Ensembl.GRCh37.gtf 
1   processed_transcript    exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002234944";
1   processed_transcript    exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00003582793";
1   processed_transcript    exon    13221   14409   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";
1   unprocessed_pseudogene  exon    11872   12227   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; exon_id "ENSE00002234632";
1   unprocessed_pseudogene  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; exon_id "ENSE00003608237";
1   unprocessed_pseudogene  exon    13225   14412   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-201"; exon_id "ENSE00002306041";
1   unprocessed_pseudogene  exon    11874   12227   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-202"; exon_id "ENSE00002269724";
1   unprocessed_pseudogene  exon    12595   12721   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-202"; exon_id "ENSE00002270865";
1   unprocessed_pseudogene  exon    13403   13655   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-202"; exon_id "ENSE00002216795";
1   unprocessed_pseudogene  exon    13661   14409   .   +   .   gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; exon_number "4"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-202"; exon_id "ENSE00002303382";

I have not still understood why UCSC decides to create a GTF like that. I can confirm that the same style is maintained for the latest GRCh38 GTF as well.

ADD REPLY
0
Entering edit mode

thanks for the pointer. it's just so confusing with them using slightly different format.. there should be a standard format at least internationally.

ADD REPLY
1
Entering edit mode

there should be a standard format at least internationally.

Good luck with that - I think this commentary says it all...

ADD REPLY
0
Entering edit mode

there should be a standard format at least internationally

Yes there is one .... The GFF3 => https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Personally I wrote an universal converter to convert all these GTF/GFF flavours to the well defined format GFF3 that can be used with all the gff3 tools.

ADD REPLY
1
Entering edit mode
7.5 years ago
sandybioteck ▴ 20

Ensembl annotates more genes than RefGene and UCSC (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8)

ADD COMMENT
1
Entering edit mode

It does (and I preferentially use Ensembl as well) but it should be pointed out that many of the transcripts in particular have very weak evidence, and can interfere in some cases with proper assignment of specific base pairs to the canonical gene. That said I still prefer it, because it is more comprehensive.

ADD REPLY

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6