Question

Are micro-first exons in GENCODE26 real ?

1

Entering edit mode

8.0 years ago

Charles Plessy ★ 2.9k

Following up on my previous question, I have inspected the length distribution of the first exons of spliced transcripts in GENCODE26.

I still find it surprising that some are tiny, often starting at the translation start site, while the transcript is not annotated as "processed". A list of all first spliced exons shorter than 4 nt can be found at the bottom of the following page https://cdn.rawgit.com/charles-plessy/gencode-feature-lengths/d55347a9/gencode-feature-lengths.html and the source code generating it is also available on GitHub.

Is there a better explanation than just imperfections in processing pipelines or annotations ?

The reason I ask is that I would like to know if there is a lower limit of first exons at the time of RNA splicing. (And the reason behind the reason is that I have a sequence library from which I can infer the length of first exons, and I would like to test if the results are reliable).

Length of first spliced exons

gencode micro-exon • 1.9k views

ADD COMMENT • link 8.0 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Thanks Mark for the detailed answer. And thanks as well to Sanger's help desks who also always gave me prompt and detailed answers. I hope that this discussion on Biostars can be useful to others, which is why I opened it in parallel.

Back to 5′ ends, is there a way to infer from GENCODE's GFF files whether a given transcript model may be truncated ?

ADD REPLY • link 8.0 years ago by Charles Plessy ★ 2.9k

1

Entering edit mode

It is only possible to directly identify 5' truncated transcripts from the GFF, if the CDS is also truncated, which are identified using the 'cds_start_NF' tag (NF; not found). You may also want to consider using the GENCODE 'basic' gene set, as this prioritises 'full-length' coding transcripts or the longest noncoding transcript isoforms.

ADD REPLY • link 8.0 years ago by Mark Thomas ▴ 80

score 4 · Accepted Answer · 2017-04-19

This is an interesting analysis of the GENCODE v26 geneset, that highlights potential idiosyncrasies in the annotation process. There are however some important points to consider when interpreting these results.

While the current geneset contains first 'exons' with less than 20nt, these do not necessarily represent the full-length of the exon. Indeed, these first exons may not actually represent the transcriptional start of the associated transcript. This is because manually annotated GENCODE transcripts are extended to the limit of the supporting evidence, which typically consists of ESTs and cDNAs. If this evidence doesn't extend to the actual TSS, then the annotated transcript will essentially be truncated at the 5' end. This appears to be the case for most of the 'micro first exons' identified here, as they coincide with longer exons in other transcript isoforms for the same gene.

Although the annotation of such short exons may be accurate, their actual value within the geneset may be limited. It may be best therefore to exclude such transcripts from any analysis of RNA splicing. On behalf of GENCODE, we welcome any comments or questions, which can be sent to either HAVANA or GENCODE.