I'm getting really confused with different annotation files from UCSC and Ensembl, with their gene/exon IDs. I'm wondering if there is a good tutorial or paper on explaining the best usage/practice with them? Specifically, I'm interested in analyzing RNA-seq data on zebrafish and human, which source would be better to use?
Thanks.
Use either. Stick with sequence/annotation from one you select until the end.
Related thread: which annotation I should for RNA-seq? Ensembl, UCSC or refseq?
hi, There is an issue with UCSC human gene annotation, in case you want to use their GTF file (dwnldbl from TableBrowser) for passing to a RNA-seq aligner. The UCSC GTF has the
gene_id
same as thetranscript_id
. This might be a ignorable thing but not if you decide to do transcript isoform level quantification (using Cufflinks, StringTie etc.) in addition to gene level. These transcript assemblers expect a hierarchical info. (which I think is the idea of a GTF). That is multiple transcript isoforms (if present) with uniquetranscript_id
assigned to a sharedgene_id
. This is absent in UCSC (TableBrowser) GTF, at least for the human. I can't comment for the zebrafish anno. Here are e.g. lines from UCSC GTF and Ensembl GTF (both GRCh37/ hg19 ver.) -I have not still understood why UCSC decides to create a GTF like that. I can confirm that the same style is maintained for the latest GRCh38 GTF as well.
thanks for the pointer. it's just so confusing with them using slightly different format.. there should be a standard format at least internationally.
Good luck with that - I think this commentary says it all...
Yes there is one .... The GFF3 => https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Personally I wrote an universal converter to convert all these GTF/GFF flavours to the well defined format GFF3 that can be used with all the gff3 tools.