I am trying to do differential expression analysis on some RNA microarray data. I was setting up my model matrix for limma from a csv file which has info on the samples, specifically how they were to be grouped (cre/flox status). Some example data is below:
geo_name,cre_lox,cell_type,treatment1,replicate_num
sample1,flox,c1,no,1
sample2,flox,c1,no,2
sample3,flox,c1,no,3
sample4,cre,c1,no,1
sample5,cre,c1,no,2
sample6,cre,c1,no,3
sample7,wt,c2,no,1
sample8,wt,c2,yes,1
The subset of the data I want (where cell_type=c1) has only "cre" and "flox" in the "cre_lox" column.
I selected for it using:
q1a_selected_col_data = col_data[(col_data$cell_type == 'c1'),]
However, when I used the function model.matrix(~q1a_selected_col_data$cre_lox)
it results in a matrix like this:
(Intercept) q1a_selected_col_data$cre_loxflox q1a_selected_col_data$cre_loxwt
1 1 1 0
2 1 1 0
3 1 1 0
4 1 0 0
5 1 0 0
6 1 0 0
How did it "know" to add a column for "wt" status even though the data I passed to it does not have "wt" in it? Is there a way I can prevent things like this without having to modify the csv or remove columns from the model matrix by hand?
Thanks. I found that the design matrix is the inverse of what I want. Basically the 1 and 0 in the q1a_selected_col_data$cre_loxflox column should be switched since cre is the experimental group.
I tried
model.matrix(~q1a_selected_col_data$cre_lox-1)
after uingdroplevels
to invert it and now I get this:I want the last column. Is there a way I can select the label
model.matrix
should mark as "1"?relevel?
it doesn't really make any difference (although it might make things simpler to reason about) since you specify the experimental comparisons in your contrasts matrix, not your design matrix