Question

How to choose the right model?

0

Entering edit mode

4.2 years ago

Aspire ▴ 360

I am analyzing human RNA-Seq data, with the help of DESeq2. I was supplied with a large metadata file, which has about 15 additional characteristics of the subjects.

1) Conceptually, what is the right way to choose which of the 15 variables should be included in the model?

2) Technically, if I am considering adding a few variables, I can add each one of them in the model, and see whether that enlarges the number of significantly DE genes. Is there a simpler way to do it?

deseq2 RNA-Seq glm • 1.0k views

ADD COMMENT • link updated 4.2 years ago by caggtaagtat ★ 1.9k • written 4.2 years ago by Aspire ▴ 360

score 2 · Accepted Answer · 2020-08-31

2

Entering edit mode

4.2 years ago

caggtaagtat ★ 1.9k

Hi, I would choose those factors, which you know influence your count matrix and whose influence on the count matrix you want to study. So for example, factors like batch, condition, biological individual, gender

The absolute number of DE genes is not a meassurement of the quality of an DGE analysis. If you increase the number of false positives, you would also increase the number of DE genes.

ADD COMMENT • link 4.2 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

1) You suggest a theory-driven approach, based on the ideas of what the important factors are. What is the right way to perform a "data-driven" approach, and to decide the right factors using the data, and not prior ideas of what the important variables are?

2) Thanks, I see your point. What would be the right way to decide whether a specific factor is important for the model? For example, when performing multi-linear regression, we could check what is the significance level of a variable we added. What would be the corollary in DESeq2?

ADD REPLY • link 4.2 years ago by Aspire ▴ 360

0

Entering edit mode

1) I mean in case you have different batches, you should always include it in your design. Same goes for other factors, which defenitly influence gene expression, like the Individual_ID, when the individuals are not clones with exactly the same genome. I do not know, how you would see it in the data, if you forget to include one of these important factors (if they apply to your data). I guess if you do PCA with the expression data, the distinct sample groups might cluster better together, but that is not always the case, especially, if you have sequenced primary cells.

For the design parameter it also matters what comparisons you want to make with the results function later.

ADD REPLY • link 4.2 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Is there a way that you know of to determine whether a specific factor is relevant if it has a small influence? For example, BMI theoretically can be influential, but it is questionable. I doubt that the inclusion of BMI will be immediately visible in much better clustering.

ADD REPLY • link 4.2 years ago by Aspire ▴ 360

0

Entering edit mode

If I understood correctly, it is recommended to include any factor in the design, which could potentially be responsible for variance in gene expression across individuals, to emphasize the effects of other factors which are of interesst. In case of the BMI, I think it would be fair to expect an overweight person to show some differences in gene expression compared to an underweight person, at least in most tissues. The height alone probably not.

Which cells do you analyse and what factors do you have, if you don't mind me asking? :)

ADD REPLY • link 4.2 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Actually I wouldn't like to share the details, if that's okay. But anyway, I am trying to get the answer in the abstract. I just gave BMI as an example of 'something that probably has an effect', but not certainly.

ADD REPLY • link 4.2 years ago by Aspire ▴ 360