Many publications state that typical way to classify mRNA vs. lncRNA is based off whether they have an open-reading frame, and that typical open-reading frames must be 300 nucleotides long.
If that is the case (or at least typical assumption), why do many lncRNA sequences have ORFs much greater than 300 nucleotides long? How do we actually know it is non-coding then?
I don't think the answer is mass-spec, because it fails to capture most proteins.
Out-of-frame stop codon density depends on GC content - high-GC sequence will have few of them. Furthermore, there are 6 frames, so you have multiple chances... it would be unlikely for an AT-rich organism to have long noncoding ORFs, but that's not necessarily true with GC-rich organisms.
Most genes are identified through protein alignments to databases of other (mostly predicted) proteins. Presumably, lncRNAs would not be very well conserved in amino-space (particularly with regards to frameshift mutations) so a lack of protein alignments might be a good clue that either you have a totally novel protein, or noncoding sequence. Specifically, I'd expect noncoding RNAs to be better-conserved in nucleotide-space than amino-space, whereas with most genes you'll get higher identity when aligned in amino-space.
I think my question is asking something more basic than your answer, but I do appreciate your input because it helps too. To better phrase my question, why do some lncRNA have open reading frames > 300 base-pairs? I thought that criteria was enough to consider it an mRNA
Open reading frames just randomly occur in non-translated genetic sequence. The length can be anything, and a non-coding RNA with high GC content will tend to have longer ORFs. There's nothing magical about 300.
I think my question is asking something more basic than your answer, but I do appreciate your input because it helps too. To better phrase my question, why do some lncRNA have open reading frames > 300 base-pairs? I thought that criteria was enough to consider it an mRNA
Open reading frames just randomly occur in non-translated genetic sequence. The length can be anything, and a non-coding RNA with high GC content will tend to have longer ORFs. There's nothing magical about 300.