edgeR and design matrix with triplicates
1
1
Entering edit mode
5.5 years ago
elb ▴ 260

Hi guys, suppose to have the following situation:

          MouseA         MouseB        MouseC

       Treatment1      Treatment1    Treatment1
       Control1        Control1      Control1  
       Treatment2      Treatment2    Treatment2
       Control2        Control2      Control2
       Treatment3      Treatment3    Treatment3
       Control3        Control3      Control3  
       Treatment4      Treatment4    Treatment4
       Control4        Control4      Control4
  

And you want to compare Treatment1 vs Control1, Treatment2 vs Control2 etc. Could you suggest me the way you build the design matrix and contrasts using edgeR? I tried by myself but I'm a little bit confused about the output so that I would like an independent (from me) opinion. Thank you in advance!

Edit: each condition is in biological triplicate (MouseA-C)

edgeR R RNA-Seq • 1.8k views
ADD COMMENT
0
Entering edit mode

What tutorials have you walked through?

ADD REPLY
0
Entering edit mode

the tutorial of edgeR

ADD REPLY
0
Entering edit mode

Does your tutorial lay out the samples and their treatments in the format you have above?

ADD REPLY
0
Entering edit mode

No and this is why I'm asking because of some doubts

ADD REPLY
0
Entering edit mode
5.5 years ago
ATpoint 85k

Typically you have a count matrix where genes are rows and columns are samples with colnames as something like this :

colnames
 [1] "Mouse1_Control1_rep1"   "Mouse1_Treatment1_rep1" "Mouse2_Control2_rep1"   "Mouse2_Treatment2_rep1" "Mouse3_Control3_rep1"   "Mouse3_Treatment3_rep1" "Mouse1_Control1_rep2"  
 [8] "Mouse1_Treatment1_rep2" "Mouse2_Control2_rep2"   "Mouse2_Treatment2_rep2" "Mouse3_Control3_rep2"   "Mouse3_Treatment3_rep2" "Mouse1_Control1_rep3"   "Mouse1_Treatment1_rep3"
[15] "Mouse2_Control2_rep3"   "Mouse2_Treatment2_rep3" "Mouse3_Control3_rep3"   "Mouse3_Treatment3_rep3"

If it is in a different format it would be non-standard and I strongly recommend transforming it to stay obey the common standards.

From there, first build what in the SummarizedExperiment world is called coldata:

coldata <- data.frame(Samples = colnames,
                      Mouse     = sapply(strsplit(colnames, split="_"), function(x)x[1]),
                      Condition = sapply(strsplit(colnames, split="_"), function(x)x[2]))

Then the design like:

design <- model.matrix(~ 0 + Condition, data = coldata)

and the contrasts:

contrasts <- makeContrasts(Contr.T1_vs_C1 = (Condition_Treatment1 - Condition_Control1),
                           Contr.T2_vs_C2 = (Condition_Treatment2 - Condition_Control2),
                           Contr.T3_vs_C3 = (Condition_Treatment3 - Condition_Control3),
                           levels=design)
ADD COMMENT
0
Entering edit mode

Thank you very much! I have another doubt. The estimateDisp should be done on the entire count matrix or on the subset of comparisons you want to look at? For example considering only samples in Treatment1 end Control1 on all the samples together? My point is: if you do not compare the Treatments across them (e.g. the effect of Treatment1 vs Treatment2), why should you consider the variability of the genes across all the samples?

ADD REPLY
0
Entering edit mode

Ohhhh wow! The answer I was looking for! Thank you very much. EdgeR does not address this point! Thank you very much!

ADD REPLY
1
Entering edit mode

This is typically how I do it, mostly for practical reasons. If you run a separate estimation for each group, results would change if you later need to compare samples which have not been processed together in the first place. Dispersion estimation over all samples and then contrasting the respective groups is for me the most reasonable approach.

ADD REPLY
0
Entering edit mode

Yes I know! My problem is that my samples are highly variable but as you correctly say the cross comparison of results could be affected if samples are considered in a "paired" analysis.

ADD REPLY
0
Entering edit mode

It probably comes down to "whatever floats your boat". If some samples inflate dispersion and prevent you from significant results between some groups (maybe even though you know about some true positives there) then remove it. Differential analysis is never a 100% exact science as p-values themselves are tail probabilities and choice of cutoffs and parameters strongly infuences the outcome. In any case the main conclusion you make from the data should stand if you run several analysis with slightly different cutoffs and parameters. Single genes can change but the main message must stand.

ADD REPLY

Login before adding your answer.

Traffic: 2504 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6