Hi All,
I am looking for some help with understanding the most logical way (mathematically speaking) that I can reduce post-normalized rna-seq counts in order to fit varying regression/tree-based models for the purpose of phenotype prediction.
Background:
- 60 "treatments": A treatment in this case is a particular full-sib family
- 200 Biological replicates: Each of the treatments (i.e. each full-sib family) has roughly 3~4 biological replicates
- Minimal pre-filtering was done on raw count data to remove transcripts with all 0's
Raw count data was normalized with a linear mixed model to account for lane, index, and familial relationships
-- counts were log2 transformed prior to normalization and given an offset of 1
-- output of normalization process is log2 counts
The normalized count matrix is now 200 x 70,000 and I would like to filter out transcripts in a way which removes the least amount of biological variation. The objective would be to get a smaller subset of around 10-20K which I could use as the input to caret for prediction modeling.
Side questions:
Question 1) Can I filter on these log transformed counts?
Question 2) If I wanted to estimate the dispersion of my normalized counts, would this make sense to do using the log-transformed or exponentiated counts? Does it even make sense to filter on dispersion? ("cries for help")
Question 3) Generally speaking, what are common practices for filtering RNA-Seq for the purposes of prediction (not necessarily for DGE)
General speaking I'd filter all genes with 0 sum counts, and average CPM/TPM/RPKM/FPKM less than 1. Then normalize filtered counts for library size and between subject variance. Limma voom is very good for this application and can incorporate quality weights for sequencing libraries. Then for regression I'd be careful with how many factors you include as the more you add the more likely you overfit. Unfortunately, like they say, RNAseq data sets are feature rich, so you'll have to do some sort of factor reduction to determine which genes are most predictive of a phenotype. For this, I'd suggest looking into lasso, ridge, and elnet regression.