----- Pre-analytical workflow -----
P0. Format raw objects to fit into Dds object, which is a correctly formatted count matrix bound together with metadata info, referred to as the colData()
. Your design can be ~1.
P1. Normalize
P2. After normalization, mark bad variants for exclusion
P3. After filtration, variance stabilization (by convention, dds
object now called vst
or rld
)
P4. Using the vst
or rld
object, generate Distance Matrices
P5. PCA on the vst
object
P6. Based on visual inspection of 4 and 5, mark bad samples, which should be observable as excessively distant samples, even from others of the same time.
----- Analytical workflow -----
A0. As before, once again load the RAW metadata and count data into Dds
A1. Exclude the bad variants identified in P2 from the count matrix of Dds, and name it TrimDds
.
A2. Remove outlying samples identified in P5 from TrimDds, and name it SlimTrimDds
.
A3. Correctly format your metadata (i.e., colData(SlimTrimDds)
). This involves removal of NA values in any columns you intend to use for modeling, and correct assignment of variables to factors, ordered factors, or numerics.
A4. Specify a biologically meaningful and mathematically cogent design
formula, which we will call Model1
. Doing this well presupposes extensive knowledge about the biology you are trying to model AND the purpose(s) of steps P2-P. Few biologists do the former, and few bioinformaticians master the latter, and finding both of those in one person is harder still.
A5. Fit that Model statement onto SlimTrimDds with design(SlimTrimDds)<-Model1
A6. Run Mod1Dds<-DESeq2(SlimTrimDds)
, then run Mod1Res<-results(Mod1Dds)
A7. - AN. Any and all downstream, post-analytical workflows.
I completely agree this is a valid approach. Also, neither of us mentioned that you can generate results using results(Dds) then access the matrix of Cook's Distances early on as well, i.e.:
P0 --> pre-filter raw counts --> P1 --> P2 valid P0 --> make res object --> parse Cook's distances before or after pre-filtering
Most often these two orders produce approximately the same results granted that the filters themselves are equally stringent in both cases. ATpoint would love to hear if your experience is different than this. I'll also readily admit that your approach is more theoretically sound in terms of processor wall-time. But even more datasets of up to N=1000, either order is quick.