I am analyzing human RNA-Seq data, with the help of DESeq2. I was supplied with a large metadata file, which has about 15 additional characteristics of the subjects.
1) Conceptually, what is the right way to choose which of the 15 variables should be included in the model?
2) Technically, if I am considering adding a few variables, I can add each one of them in the model, and see whether that enlarges the number of significantly DE genes. Is there a simpler way to do it?
1) You suggest a theory-driven approach, based on the ideas of what the important factors are. What is the right way to perform a "data-driven" approach, and to decide the right factors using the data, and not prior ideas of what the important variables are?
2) Thanks, I see your point. What would be the right way to decide whether a specific factor is important for the model? For example, when performing multi-linear regression, we could check what is the significance level of a variable we added. What would be the corollary in DESeq2?
1) I mean in case you have different batches, you should always include it in your design. Same goes for other factors, which defenitly influence gene expression, like the Individual_ID, when the individuals are not clones with exactly the same genome. I do not know, how you would see it in the data, if you forget to include one of these important factors (if they apply to your data). I guess if you do PCA with the expression data, the distinct sample groups might cluster better together, but that is not always the case, especially, if you have sequenced primary cells.
For the design parameter it also matters what comparisons you want to make with the results function later.
Is there a way that you know of to determine whether a specific factor is relevant if it has a small influence? For example, BMI theoretically can be influential, but it is questionable. I doubt that the inclusion of BMI will be immediately visible in much better clustering.
If I understood correctly, it is recommended to include any factor in the design, which could potentially be responsible for variance in gene expression across individuals, to emphasize the effects of other factors which are of interesst. In case of the BMI, I think it would be fair to expect an overweight person to show some differences in gene expression compared to an underweight person, at least in most tissues. The height alone probably not.
Which cells do you analyse and what factors do you have, if you don't mind me asking? :)
Actually I wouldn't like to share the details, if that's okay. But anyway, I am trying to get the answer in the abstract. I just gave BMI as an example of 'something that probably has an effect', but not certainly.