I found a few zero-length exons in GENCODE... Does anybody know if that is a bug, or if it has a meaning ?
zcat gencode.v25.annotation.gtf.gz | awk '$3 == "exon" && ($5 - $4) == 0' | cut -f1,2,3,4,5,7,9 | cut -c1-65 chr2 ENSEMBL exon 96695297 96695297 + gene_id "ENSG00000249715.9" chr2 ENSEMBL exon 166473892 166473892 - gene_id "ENSG00000136546. chr4 ENSEMBL exon 1730388 1730388 + gene_id "ENSG00000013810.18"; chr4 ENSEMBL exon 169663114 169663114 + gene_id "ENSG00000109572. chr5 ENSEMBL exon 796064 796064 - gene_id "ENSG00000188818.12"; t chr5 HAVANA exon 88804598 88804598 - gene_id "ENSG00000081189.14" chr11 ENSEMBL exon 71580167 71580167 - gene_id "ENSG00000204571.5 chr11 ENSEMBL exon 76191778 76191778 - gene_id "ENSG00000085741.1 chr11 ENSEMBL exon 101050949 101050949 - gene_id "ENSG00000082175 chr14 ENSEMBL exon 24632719 24632719 - gene_id "ENSG00000100453.1 chr16 ENSEMBL exon 89553267 89553267 + gene_id "ENSG00000197912.1 chr17 ENSEMBL exon 41624191 41624191 - gene_id "ENSG00000128422.1 chr17 ENSEMBL exon 43883386 43883386 - gene_id "ENSG00000108852.1 chr18 ENSEMBL exon 9887458 9887458 + gene_id "ENSG00000168454.11" chr19 ENSEMBL exon 49836839 49836839 + gene_id "ENSG00000104973.1
Thanks Denise and Devon. I am now reading about microexons, which I admit have overlooked so far. Perhaps I will contact the helpdesk about cases where the first exon is tiny, because at the moment I do not see how splicing can function in that case. See this example (the only one) where the first exon has a length of 1:
For the record, here is the whole transcript:
Edit on January 20th, 2017: corrected a small bug in the one-liner, by adding ";" after "exon_number 1". Results unchanged.
ENST00000637754 is a HAVANA transcript based on RNA Seq data only. For this case, it may be quicker to contact HAVANA directly. If there is a mistake with that transcript (and tiny first exon), they will correct it and the revised annotation will be available in Ensembl once Ensembl's annotation gets merged with HAVANA. I'd guess the RNASeq reads they map to the genome did not allow them to extend the 5' end of that model.
At least some of these make more sense if you look at them in the context of the other annotated isoforms. The Mef2c isoform is a processed transcript where in the protein coding isoforms that microexon is much larger. I bet in most of these cases what you're seeing are truncated transcripts that are annotated as "processed transcript" (I still have no real clue what that means).
Processed transcript is a name given by HAVANA to say that the transcript is not coding. Check their help on the VEGA site. The processed transcript can be a lncRNA, a ncRNA or everything else (the unclassified). If in the next rounds of annotation there is further transcriptional evidence to expand that model, then it may be possible to find an ORF and re-classify the processed transcript into something else, coding.
Makes sense, thanks!