Hello, I have been asked to identify and quantify the number of ncRNA transcripts in some datasets from my lab. FYI, the data is Poly A enriched rna-seq (we aren't not doing any analysis, just attempting to document what exists in the dataset) I found some very helpful hints here at C: Non-coding RNA detection, suggesting that I awk the 3 column for "ncRNA". I am using Homo_sapiens.GRCh38.86.gtf. Further inspection of my gtf file shows that lincRNA are the only strings returned if I grep "ncRNA", and aren't in the feature column, but in the biotype column. Would I be remiss in assuming that these are the same thing, and searching for this string will be representative of ALL lncRNA? Any constructive criticism on this approach would be appreciated. Thanks!
Further investigation into my GTF file as revealed that lincRNAs are not the only ncRNA subtype included in the gene_subtype column. Using rtracklayer for R, I imported my GTF file, transformed into a dataframe, filter by type=gene. Vectorized the gene_biotype column, greped 'RNA' and ran unique. The number of returned values are below.
If you truly need all non-coding RNA then use all ot hese: