Often times I find myself checking if a mapped region overlaps with known regions of the genome. To do this, I use a set of genes that includes merged transcripts from UCSC, Ensembl, Refseq, Gencode, and Vegagene.
Usually this works just fine, but now I am looking for atypical types of transcripts such as siRNAs, lincRNAs, and all small RNA types. I'm not sure if the above annotations are comprehensive enough.
My questions to you are:
Can we (as a community) create a list of resources/websites where we can gather these genes?
For RefSeq, I would use NCBI's website (the creator of RefSeq) and download it from that FTP instead of UCSC. The thing is that UCSC re-aligns RefSeqs and these models differ from the original ones.
The original RefSeq alignments are done using manual curation of automatic Gnomon models that come from a very powerful Genome Annotation Pipeline, aka Gpipe, that is used for eukaryotic and now prokaryotic annotation, http://www.ncbi.nlm.nih.gov/genome/annotation_euk/ and http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and its Splign aligner. Gpipe takes into consideration different sorts of data, including curation. Therefore Gnomon models as well as manually curated RefSeq models are of good quality.
UCSC takes RefSeq sequences and re-aligns them to the genomes using BLAT which is not as powerful as Gpipe/Gnomon/Splign. The cause of the most problems is that exons with indels are converted into two exons with micro-introns in the middle.
Re: "UCSC re-aligns RefSeqs and these models differ from the original ones."... I investigated this pretty thoroughly last year. The two essential results are: 1) ~886 transcripts have significant genome coordinate discrepancies between RefSeq (splign) and UCSC (BLAT); and 2) When there's a discrepancy, splign's alignments are more often more parsimonious than BLAT's based purely on sequence identity (roughly 30:1 bias). Details in this slideshare deck: http://goo.gl/05sxpm. Data current as of Feb 2014.
ADD REPLY
• link
updated 2.4 years ago by
Ram
44k
•
written 9.6 years ago by
Reece
▴
310
0
Entering edit mode
Yes, there is a difference due to obvious reasons. I agree with your conclusions, Reece. Also, thank you for sharing your Slideshare link and HGVS code, very interesting indeed!
Ensembl also contains information on small RNAs in addition to transcripts. For instance, this BioMart example query retrieves locations for several small RNA types:
Thanks for the response, Ensembl definitely makes it easy to retrieve a list of specific type of transcript.
However, It's my understanding that Ensembl and UCSC are incomplete. I'm not entirely sure how UCSC and Ensembl construct their annotations but I believe that, for instance, the genes from psuedogene.org are not all annotated in Ensembl. Does anyone have an idea how much of them are and why?
One thing that many people might not realize is the relation between Ensembl, Vega/Havana and Gencode.
Through the Gencode project, Ensembl now incorporates the manual gene annotation provided by Vega/Havana into the automatic annotation. For most cases the data is the same between Ensembl (fetched via API, database or BioMart) and Gencode (fetched from the FTP site or from UCSC). Current differences are that Gencode excludes the haplotype annotation and adds pseudogene models from the Yale and UCSC ENCODE groups. The UCSC "2way Pseudogenes" track provides those additional models where these two sets agree.
RefSeq models are incorporated in the Ensembl and Havana gene build processes. The different small RNA gene types are included in the Ensembl set.
Access to the gene set is also described here, but if I find it most convenient to use the Ensembl Perl API access.
Re: "UCSC re-aligns RefSeqs and these models differ from the original ones."... I investigated this pretty thoroughly last year. The two essential results are: 1) ~886 transcripts have significant genome coordinate discrepancies between RefSeq (splign) and UCSC (BLAT); and 2) When there's a discrepancy, splign's alignments are more often more parsimonious than BLAT's based purely on sequence identity (roughly 30:1 bias). Details in this slideshare deck: http://goo.gl/05sxpm. Data current as of Feb 2014.
Yes, there is a difference due to obvious reasons. I agree with your conclusions, Reece. Also, thank you for sharing your Slideshare link and HGVS code, very interesting indeed!