The RNA-seq gene count TPM files that I have contain 60,000 genes. Below is a breakdown of their types.
In a case-control analysis, should I use either: (a) only the protein coding genes or (b) all of the gene types?
gene_type | count
------------------------------------+----------------
protein_coding 19962
lncRNA 16901
processed_pseudogene 10167
unprocessed_pseudogene 2614
misc_RNA 2212
snRNA 1901
miRNA 1881
TEC 1057
snoRNA 943
transcribed_unprocessed_pseudogene 939
transcribed_processed_pseudogene 500
rRNA_pseudogene 497
IG_V_pseudogene 187
IG_V_gene 145
transcribed_unitary_pseudogene 138
TR_V_gene 106
unitary_pseudogene 98
TR_J_gene 79
scaRNA 49
polymorphic_pseudogene 48
rRNA 47
IG_D_gene 37
TR_V_pseudogene 33
Mt_tRNA 22
IG_J_gene 18
pseudogene 18
IG_C_gene 14
IG_C_pseudogene 9
ribozyme 8
TR_C_gene 6
sRNA 5
TR_J_pseudogene 4
TR_D_gene 4
IG_J_pseudogene 3
Mt_rRNA 2
translated_processed_pseudogene 2
vault_RNA 1
translated_unprocessed_pseudogene 1
scRNA 1
IG_pseudogene 1
I didn't want to bias initial responses, but I saw this paper that said:
"For most RNA-Seq sequencing projects, only mRNAs are presumably enriched and sequenced, and there is no point in mapping sequence reads to RNAs such as miRNAs or lincRNAs."
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8#:~:text=%5B6%5D%20h.,correspond%20to%20protein%20coding%20genes.
Most RNAseq library prep protocols won't capture RNA shorter than 200nt. This would immediately mean that it shouldn't capture miRNA, snRNA or snoRNA (although you do sometimes get reads mapping to snRNA or snoRNA). Or some subsets of transcripts in each of the the rest of the categories.
Many RNA-seq experiments also enrich for poly-adenylated transcripts, so you should see few reads from transcripts that are no poly-adenylated. Most protein-coding genes are poly-adenylated (although there are a subset of histone genes that use a terminal triplex formation rather than a poly-A tail). miRNA, snoRNA, snRNA and rRNAs are definately not poly-adenylated. But for the rest of the categories - some are, some arn't. The link you posted is making the common, but incorrect assertion that lncRNAs are not poly-adenylated, which is incorrect. Some lncRNAs are poly-adenylated, others aren't.
In general, the only reason not to include genes in an analysis is the extra multiple testing burden in imposes: you are doing 60,000 tests rather than 20,000. In practice, I'm not sure how much difference that has every made to the overall conclusions of analyses i've done.
I am in the middle of writing a tutorial on this situation, where I observed upon analysis the number of differentially expressed genes changes by about 30%. It goes from 150 to about 200 after removing low expression genes.
I suspect that the effect of filtering is hard to predict when it comes to FDR since producing an FDR relies on the distribution of the p-values. Those p-values could vary more substantially for lowly expressed genes because the data is less reliable.
DESeq2 automatically removes low expression genes, no?
And it does that by picking the the expression threshold that maximises the number of genes passing the FDR threshold.
thank you! wow, so RNAseq is still relatively unexplored.
I suppose all "no poly-adenylated" transcripts would at least comparable to each other in that they are all unenriched. i normalize/scale all my features so that should help reduce the bias of poly-adenylation.
Its not that we don't know which transcripts are poly-adenylated and which arn't - its just that that information is not encoded in the gene/transcript biotype.