fixed effect vs random effects RNA analysis
1
0
Entering edit mode
3.5 years ago
Will ▴ 20

Hi, when choosing a formula for the differential expression analysis in R, for e.g. to RNA count data, how do i choose exactly in which variable put the (1|x), i.e. when a variable must be modeled as random or fixed effect ? The decision is only based if the variable is continuous or categorical?

what are the differences between ~ feature + AGE + (1|SEX) + (1|GROUP) + (1|Individual) and ~ feature + AGE + SEX + GROUP + (1|Individual)

where group is the category of a subject and individual is just the ID used for dealing with repeated measures.

variancePartition voom LIMMA DE RNA • 5.0k views
ADD COMMENT
0
Entering edit mode

I note that you've specified Limma/Voom in your tags, but as far as I am aware, limma does not handle random effects in with this sort of design formula. Instead you can specify repeat measurements on the same indevidual with the duplicateCorrelation function. There is a package dream that extends the limma framework to handle arbitrary random effects. Is this perhaps what you want to use?

ADD REPLY
0
Entering edit mode

Yes, sorry. I use voomWithDreamWeights instead of voom and dream instead of lmFit

ADD REPLY
7
Entering edit mode
3.5 years ago
Steven Lakin ★ 1.8k

The second formula is probably the one you want: ~ feature + AGE + SEX + GROUP + (1|Individual)

A very simple way of looking at this is: you should use a fixed effect if you directly observe a feature from this particular group of data and a random effect if you are modeling a small sample from a larger population. The features SEX, AGE, and GROUP are observed for these data, so you would model them as fixed. Alternatively, Individual is a sample drawn from a larger population of individuals (i.e. you could replace this set of individuals with any other drawn from the same population), so it is modeled as a random effect.

A caveat: how to model mixed effects in frequentist regression is a more complicated and nuanced topic than I explain above and is discussed elsewhere by experts like Gelman. If you're doing a lot of this kind of work, it's probably worth reading an introductory textbook to fully understand the subject. I like "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Gelman for a gentle introduction to the topic with code in R.

ADD COMMENT
3
Entering edit mode

Also, depending on the model, the fixed-effect/random-effect specification may determine which comparisons you can or cannot do across your levels. There are some nice Biostars answers/discussions here or here.

ADD REPLY
0
Entering edit mode

In this case why the subject ids is handled as random effect? Because since we are repeated measures for each subject, it would be like dividing by subgroups?

ADD REPLY
5
Entering edit mode

Your question is a good one but complicated to answer without pointing to an entire course syllabus of concepts in regression. The easiest way to think about it is to understand what the goal of modeling is:

Given a dataset that has variation in the data points, we wish to describe that variation in ways that make organizational sense to us (i.e. in terms of factors in our experimental design, like treatment group, sex, age, etc).

One way to describe variance is that variance falls into four major categories:

  1. Process variance: variance due to systematic factors, such as our experimental treatment, sex, age, etc.
  2. Sampling variance: variance due to incomplete sampling (we observe a set of points that is smaller than the whole population, therefore by random chance over repeated samples we expect some variation around the true population central tendency on any given draw)
  3. Calibration/observation variance: variance due to error in our instruments or measurement process
  4. Group/hierarchical variance: variance due to higher order factors like geographical location (sampling on one continent versus another)

Fixed effects are often (but not always) used to model process variance: the variance you want to describe due to the systematic structure of your experiment, such as age, sex, and group in your design.

The factor you've called "Individual" (AKA ID, animal_id, person_id, etc), on the other hand, we expect to capture sampling variance, since your coding of individual describes the random sampling of a larger population. Therefore, we should try to model it as a random effect, since it arises from a random variable descriptive of the population at large.

By capturing that sampling variation in a random effect, you should improve the tightness of your estimates around the fixed effects. Think of it like an equation:

All variation = process + sampling + calibration + group

If you remove variation due to sampling from the pool of variance, then variance estimates around the central tendencies of your fixed effects should be less (i.e. they should tighten). Likewise, if that is not the case, or inspection of your residuals reveals a poor model fit, it may not be appropriate to use that model, which is why we always perform diagnostic checks to make sure our assumptions are correct during modeling.

This is a long way to say: using an appropriate random effect to model sampling variance will improve estimates around your fixed effects and is just the appropriate way to model data for this particular experimental design. Any more complex of an answer will require that you take a course in regression or refer to a textbook.

As a side note, I would be careful using the term "repeated measures," because that would mean you collected multiple data points from individuals across time (or location, etc), and that would need to be included in your formula in a different way.

ADD REPLY
0
Entering edit mode

Thanks a lot! Last question. What is the difference between ~ and ~0 ? The ~0 removes the intercept.

ADD REPLY
1
Entering edit mode

~ 0 forces the intercept to be through the origin when all covariates (independent variables) are set to zero. It is not often a valid assumption, and you would need to know exactly what effects it would have on your model for its use to be valid. It is essentially saying: "when all of my covariates are set to 0 (whether they are categorical, numeric, or ordinal), then the outcome (dependent variable) should be 0". Since you are regressing on gene expression, you would be setting the expression level to zero. Personally, I don't think it fits that use case, but that determination must be made on a case by case basis by subject matter experts in the subject being modeled.

ADD REPLY

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6