So I have two different RNA-seq experiments which I am going to call experiment 1 and 2.
Experiment 1 and 2 overviews in r:
glimpse(Experiment_1)
Rows: 47,069
Columns: 10
$ Gene <chr> "ENSMUSG00000000001", "ENSMUSG00000000003", "ENSMUSG00000000028", "ENSMUSG00000000031", "ENSMUSG00000000037", "ENSMUSG000000…
$ Symbol <chr> "Gnai3", "Pbsn", "Cdc45", "H19", "Scml2", "Apoh", "Narf", "Cav2", "Klf6", "Scmh1", "Cox5a", "Tbx2", "Tbx4", "Zfy2", "Ngfr", …
$ DMSO_Jour2_N2 <int> 23857, 0, 2696, 63, 3, 59, 4735, 327, 8869, 4549, 19894, 4, 0, 1, 3, 825, 27, 967, 8475, 3726, 8928, 2012, 2413, 6674, 0, 39…
$ DMSO_Jour2_N1 <int> 19501, 0, 2234, 36, 7, 52, 4116, 302, 7344, 4118, 18673, 3, 0, 1, 3, 738, 21, 722, 6512, 3175, 6986, 1721, 2068, 6143, 1, 34…
$ X4OHT_Jour2_N4 <int> 26304, 0, 2218, 176, 4, 40, 4719, 1800, 12798, 4648, 17333, 2, 0, 0, 0, 425, 41, 1155, 9205, 3685, 5419, 1854, 2429, 5037, 0…
$ X4OHT_Jour2_N1 <int> 20747, 0, 1745, 126, 10, 32, 3904, 1164, 8172, 3849, 14201, 4, 0, 0, 1, 292, 41, 905, 6955, 2909, 3903, 1502, 1766, 4834, 0,…
$ DMSO_Jour2_N4 <int> 23692, 0, 2411, 84, 7, 45, 3988, 386, 10012, 4000, 17047, 3, 0, 1, 3, 903, 18, 937, 8061, 3335, 9281, 1814, 2242, 5423, 1, 3…
$ DMSO_Jour2_N3 <int> 24641, 0, 2268, 58, 4, 52, 4919, 284, 8327, 3784, 18543, 2, 0, 0, 0, 663, 21, 914, 7923, 3511, 7063, 1886, 2022, 6116, 0, 37…
$ X4OHT_Jour2_N3 <int> 18933, 0, 1620, 75, 3, 30, 2894, 1200, 7935, 3237, 12891, 1, 0, 1, 0, 315, 41, 861, 6441, 2754, 3683, 1372, 1499, 4007, 0, 3…
$ X4OHT_Jour2_N2 <int> 23235, 0, 1877, 84, 5, 50, 3804, 1138, 9085, 3867, 16628, 4, 0, 0, 0, 355, 39, 978, 7224, 3318, 4277, 1620, 1712, 5443, 1, 4…
glimpse(Experiment_2)
Rows: 47,069
Columns: 8
$ Gene <chr> "ENSMUSG00000000001", "ENSMUSG00000000003", "ENSMUSG00000000028", "ENSMUSG00000000031", "ENSMUSG00000000037", "ENSMUSG000000…
$ Symbol <chr> "Gnai3", "Pbsn", "Cdc45", "H19", "Scml2", "Apoh", "Narf", "Cav2", "Klf6", "Scmh1", "Cox5a", "Tbx2", "Tbx4", "Zfy2", "Ngfr", …
$ X4OHT_Wnt3A_N3 <int> 11155, 0, 1258, 23, 4, 14, 1661, 1158, 6316, 2237, 7656, 0, 0, 0, 2, 283, 23, 543, 4489, 1649, 3288, 769, 1141, 2043, 1, 175…
$ X4OHT_Wnt3A_N2 <int> 8759, 0, 1228, 20, 1, 4, 1520, 859, 4903, 2076, 6249, 2, 0, 0, 1, 327, 8, 469, 4249, 1474, 5149, 769, 1149, 1427, 0, 1202, 1…
$ X4OHT_Wnt3A_N1 <int> 9200, 0, 1158, 35, 0, 10, 1217, 1138, 5631, 1913, 6097, 0, 0, 0, 1, 357, 16, 509, 4175, 1516, 5052, 817, 1201, 1662, 2, 1233…
$ DMSO_Wnt3A_N3 <int> 8061, 0, 899, 27, 0, 14, 1788, 142, 3114, 1416, 6476, 0, 0, 0, 0, 346, 4, 342, 3070, 1167, 3567, 683, 880, 1865, 0, 1130, 73…
$ DMSO_Wnt3A_N1 <int> 8808, 0, 1277, 14, 1, 9, 1049, 647, 5472, 2097, 6888, 0, 0, 0, 0, 558, 3, 440, 4058, 1552, 6196, 751, 1118, 1974, 0, 1170, 1…
$ DMSO_Wnt3A_N2 <int> 9096, 0, 1373, 32, 0, 13, 1360, 451, 4771, 2058, 6831, 1, 0, 0, 0, 591, 13, 437, 4166, 1590, 6693, 822, 1224, 1772, 0, 1320,…
In experiment 1, I compared condition "DMSO_Jour2" vs condition "X4OHT_Jour2", 4 replicates each;
In experiment 2, I compared condition "DMSO_Wnt3A" vs condition "X4OHT_Wnt3A", 3 replicates each.
Now I would like to get the gene_counts of those two different experiments and combine them (they have the same number of rows, aka genes) into two rearranged tables in a way to make two differential analyses, one comparing DMSO_Jour2
to DMSO_Wnt3A
(called experiment "DMSO", see below) and another one comparing X4OHT_Jour2
to X4OHT_Wnt3A
(called experiment "OHT", see below). Should I have to add a batch effect when I design my matrix?
So far this is how I make batch effect in R, following was an attempted first option to include batch effect in design matrix, which I am not able to...
replicate_DMSO <- factor(c(1,1,1,1,2,2,2))
replicate_OHT <- factor(c(1,1,1,1,2,2,2))
Batch <- factor(c(1,1,1,1,2,2,2))
design_DMSO <- model.matrix(~replicate_DMSO + Batch)
design_OHT <- model.matrix(~replicate_OHT + Batch)
design_DMSO
(Intercept) replicate_DMSO2 Batch2
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 1 1
6 1 1 1
7 1 1 1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$replicate_DMSO
[1] "contr.treatment"
attr(,"contrasts")$Batch
[1] "contr.treatment"
design_OHT
(Intercept) replicate_OHT2 Batch2
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 1 1
6 1 1 1
7 1 1 1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$replicate_OHT
[1] "contr.treatment"
attr(,"contrasts")$Batch
[1] "contr.treatment"
The problem is then when I try to run later in edgeR analysis common dispersion I got those errors:
y_DMSO <- estimateGLMCommonDisp(y_DMSO, design_DMSO, verbose=TRUE)
Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
Design matrix not of full rank. The following coefficients not estimable:
Batch2
y_OHT <- estimateGLMCommonDisp(y_OHT, design_OHT, verbose=TRUE)
Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
Design matrix not of full rank. The following coefficients not estimable:
Batch2
I know there is something off with my design matrix so I am wondering, do I really have to consider a batch effect in this case and how to address it to the design of the matrix.
When I do differential expression without batch effect I get tons of over-expressed genes.
As a second option to treat batch effect I used ComBat_seq (without normalizing first) from package sva on the total count matrix including experiment 1 and 2, then splitting into two table matrices for running the differential analyses DMSO and OHT.
Here, no error, but it kills all the variation and nothing is differentially expressed. I generated a dendrogram before sva batch effect and it looks pretty much I have two clusters that corresponded to the two batches...they disappear when I treat for batch with sva.
Any insights? Do I really have to treat for batch effect in this case?
Thank you very much.
It sounds like your 2 batches have different conditions. In other words, none of the samples in either batch should be similar. Is that correct?
that is correct! Thanks