Question

Lmer() for RNA-seq with batch as random effect: how to prepare input?

0

Entering edit mode

9.2 years ago

Ekarl2 ▴ 120

I have an RNA-seq dataset with two treatments (A and B) and four environments in a classic +/+, +/-, -/+, -/- design with four replicates for environment for a total of 16 samples. There are also four batches: 1 replicate of each environment occurs in the first batch, 2 replicate of each environment in second batch and so on.

So far, I have used the glm methods of EdgeR with the model ~batch+A*B and gotten out the effects of A and B and interactions.

However, more senior researchers wants me to provide them with results for a mixed model approach and use the lmer() function from the lme4 package to model the batch as a random factor instead. Their concern was, as I have understood it, that the effects of B might be drowned out if batch is treated as a fixed factor. So they want to find how much variation in B contributes to the observed expression values.

From reading the manuals, vignettes and tutorials, I understand the basic idea of the lmer() function on their example datasets, but the input format seems widely different from mine.

Their format:

> head(Dyestuff)
  Batch Yield
1     A  1545
2     A  1440
3     A  1440
4     A  1520
5     A  1580
6     B  1540

> str(Dyestuff)
'data.frame':   30 obs. of  2 variables:
 $ Batch: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 2 2 2 2 2 ...
 $ Yield: num  1545 1440 1440 1520 1580 ...

> summary(Dyestuff)
 Batch     Yield
 A:5   Min.   :1440
 B:5   1st Qu.:1469
 C:5   Median :1530
 D:5   Mean   :1528
 E:5   3rd Qu.:1575
 F:5   Max.   :1635

My format, on the other hand, is a matrix with 16 columns and many thousands rows, each row is one gene and each column is a sample (a standard RNA-seq input format). I know how to define experimental factors for this experimental design, but when I try to run this code:

#!/usr/bin/Rscript

# Load libraries.
library(lme4)

# Loads counts file from e. g. RSEM
data.set <- read.table("genes.counts.matrix")
data.set <- round(data.set)

# Creates factors for A, B and batch effect
A <- factor(substring(colnames(data.set),1,6))
B <- factor(substring(colnames(data.set),7,10))
batch <- factor(substring(colnames(data.set),12,12))

fm01 <- lmer(data.set ~ A*B  + (1|batch))

I get this error message:

Error in model.frame.default(drop.unused.levels = TRUE, formula = data.set ~  :
  invalid type (list) for variable 'data.set'
Calls: lmer ... <Anonymous> -> eval -> eval -> model.frame -> model.frame.default
Execution halted

I suspect this is because the input format is wrong. Making it a data.frame() did not help. What are some productive approaches to this issue? Make a new table with log2 FC for B as the response variable (from, say, only those eight samples were A is unchanged) and calculate it from each batch independently?

lmer rna-seq batch mixed model • 4.2k views

ADD COMMENT • link updated 6.6 years ago by Ram 45k • written 9.2 years ago by Ekarl2 ▴ 120

Ram · Accepted Answer · 2016-02-22

You'll end up needing to make a data frame with the relevant design and then mapply() a function with that to the matrix, since as is you're trying to fit the whole matrix in one go rather than fitting each row. Play around with one row of your matrix and get lmer to work with that and then you'll have a good starting point for scaling up to the full matrix.

As an aside, this is largely a waste of your time (your senior colleagues might want to look up how empirical Bayes methods are any some respects similar to what you end up doing in a mixed-effect model).