I have RNA-seq data of four different tissues and I need to make differential expression analysis of each against the rest of them. I am using R package pachterlab/sleuth to make the differential expression tests (Wald test). The sample to covariates table is something like this:
sample tissue
S10_Lu_A Lung
S11_Lu_B Lung
S13_SP_A Spleen
S14_SP_B Spleen
S15_SP_C Spleen
S4_BM_A BM
S5_BM_B BM
S6_BM_C BM
S7_In_A Intestine
S8_In_B Intestine
S9_In_C Intestine
My first approach was assigning a dummy factor to produce the model matrix I wanted, so if for example I wanted to test BM against the rest I got something like this:
sample tissue contrast
S10_Lu_A Lung A
S11_Lu_B Lung A
S13_SP_A Spleen A
S14_SP_B Spleen A
S15_SP_C Spleen A
S7_In_A Intestine A
S8_In_B Intestine A
S9_In_C Intestine A
S4_BM_A BM B
S5_BM_B BM B
S6_BM_C BM B
And then I would simply use model.matrix(~ contrast) and perform the test on contrastB. However, I have been told this is incorrect as I need to take into account the tissue each sample comes from instead of obtaining a global mean of A.
My problem is that if I use ~ tissue to keep the individual tissues I lose the BM group in the intercept, and of course I can't add the contrast dummy as a covariate because it wouldn't be independent of the tissue. Maybe this is stupid, but I can't figure out how to model the contrast: BM vs. (lung + intestine + spleen) now.
I thought of using model.matrix( ~ tissue - 1) so I conserve all the individual groups, but then I don't get statistical significance of the contrasts.
I'm aware the simplest solution would be to rename BM to something else, as the reason it is being chosen as "default" is because it is the first alphabetical group, but I don't know if that is correct either or if I am missing some important statistical point. Can anyone shed some light on this please?
I've not used sleuth before, but for limma the following design would work: