I have a question about transcript length: can I know the reasonable "range" in base pair for transcript length? When I use the getlength() function in the goseq bioconductor package, which uses UCSC genome browser for each combination of genome and id, I found the range to be from ~300 bp to ~80,000 bp. Is that long transcript reasonable?
Sorry I don't know much about related biology, but from a statistical perspective, I may consider that it is an outlier...Is this true? Thank you very much!
Depends what you call outlier. The human genome codes for Titin, a protein with > 30,000 aminoacids, hence the mRNA should be >90,000 bp. So, 80,000 bp is in fact a little short but the number of protein-coding transcripts that long is small.
You should not be overly concerned with outliers when using RefSeq, Ensembl or Havana standards to define gene and transcript coordinates and hence their lengths. Titin is a great example (+1) and isoform NM_133378 is 101520 bp long. Note the RefSeq accession. Keep in mind that non-protein coding genes may have a very different distribution of length - microRNAs are quite short and lincRNAs can be long. Transcribed pseudogenes would generally be shorter than the functional version of that gene.