I have an RNA-seq dataset with two treatments (A and B) and four environments in a classic +/+, +/-, -/+, -/- design with four replicates for environment for a total of 16 samples. There are also four batches: 1 replicate of each environment occurs in the first batch, 2 replicate of each environment in second batch and so on.
So far, I have used the glm methods of EdgeR with the model ~batch+A*B and gotten out the effects of A and B and interactions.
However, more senior researchers wants me to provide them with results for a mixed model approach and use the lmer() function from the lme4 package to model the batch as a random factor instead. Their concern was, as I have understood it, that the effects of B might be drowned out if batch is treated as a fixed factor. So they want to find how much variation in B contributes to the observed expression values.
From reading the manuals, vignettes and tutorials, I understand the basic idea of the lmer() function on their example datasets, but the input format seems widely different from mine.
Their format:
> head(Dyestuff)
Batch Yield
1 A 1545
2 A 1440
3 A 1440
4 A 1520
5 A 1580
6 B 1540
> str(Dyestuff)
'data.frame': 30 obs. of 2 variables:
$ Batch: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Yield: num 1545 1440 1440 1520 1580 ...
> summary(Dyestuff)
Batch Yield
A:5 Min. :1440
B:5 1st Qu.:1469
C:5 Median :1530
D:5 Mean :1528
E:5 3rd Qu.:1575
F:5 Max. :1635
My format, on the other hand, is a matrix with 16 columns and many thousands rows, each row is one gene and each column is a sample (a standard RNA-seq input format). I know how to define experimental factors for this experimental design, but when I try to run this code:
#!/usr/bin/Rscript
# Load libraries.
library(lme4)
# Loads counts file from e. g. RSEM
data.set <- read.table("genes.counts.matrix")
data.set <- round(data.set)
# Creates factors for A, B and batch effect
A <- factor(substring(colnames(data.set),1,6))
B <- factor(substring(colnames(data.set),7,10))
batch <- factor(substring(colnames(data.set),12,12))
fm01 <- lmer(data.set ~ A*B + (1|batch))
I get this error message:
Error in model.frame.default(drop.unused.levels = TRUE, formula = data.set ~ :
invalid type (list) for variable 'data.set'
Calls: lmer ... <Anonymous> -> eval -> eval -> model.frame -> model.frame.default
Execution halted
I suspect this is because the input format is wrong. Making it a data.frame()
did not help. What are some productive approaches to this issue? Make a new table with log2 FC for B as the response variable (from, say, only those eight samples were A is unchanged) and calculate it from each batch independently?