Thats strange as some exons (protein coding or lncRNA are only 1bp or 2bp long). Is it bioinformatics or (probably not) biology? Has anyone ever noticed something like that with different annotation?
The coordinates in a gtf are inclusive. So you should $5 - $4 + 1. So the lengths are actually 1. Still pretty weird that you get exon length of 1 though.
I have contacted and asked Gencode staff about this issue (in February). They have answered and hoped that problem will be fixed until Gencode.v16.
Apparently there was a bug in one of their scripts.
"... there should be no exons in Gencode <3bp. Alignments of <3bp can not be trusted, even when spanning known splice junctions, or confirming known UTRs/retained introns".
Current Gencode annotation (v18) still have this problem (don't know why they haven't fixed it yet).
I would suggest filtering those exons out.
OTTHUMT00000321563 has 2bp first (coding) exon because it is 5'
incomplete and those two bases align to a reference exon. Though
arguably they could also align to the exon before that and other more
upstream exons. I have now deleted that exon.
OTTHUMT00000470867 doesn't have 1 bp exon in our internal database
any more, it's 227 long now. So that should be in a future Ensembl
update.
I will go through the short-exon list from Gencode v18 and fix where necessary.
The coordinates in a gtf are inclusive. So you should $5 - $4 + 1. So the lengths are actually 1. Still pretty weird that you get exon length of 1 though.
Some non-coding RNAs shared in protein coding genes are marked with 0 or 1 length.
Thanks, fixed it.