Entering edit mode
10.2 years ago
tucanj
▴
100
I have tumor age (nominal variable, divided into age brackets), gene expression (in RPKM) and sex (binary) for many tumors in people at different ages (not paired..ie. not time series of same tumors but different tumors at each age). I want to find the genes that are most differentially expressed with age, controlling for sex. What would be the best way to do this?
More specifically:
- Do I need to transform the RPKM values to rank (as per this thread)? The difference is that they use regularized regression.
- Is a linear regression appropriate? This paper seems to suggest quantile regression is better (however there are other features of their algorithm, and their age is not ordinal) to linear regression which is used in much of the aging gene expression studies with microarray.
- Because I am repurposing data and do not know batch would it be appropriate to run sva (or something similar) on it?
Thanks!
If I could get read counts instead of RPKM, would that affect my analysis?
Quite likely, yes. There are a number of issues with using RPKMs and raw counts tend to make life much simpler.
Can you please elaborate? Would I do the linear regression on the counts instead of RPKM? Would I need to include other factors in the regression such as gene length?
If you can get the counts then you'll use a GLM rather than straight linear regression (just use the DESeq2, edgeR or limma/voom Bioconductor packages). There's no reason to include gene length in the design (at least unless samples are significantly biased differently by it, but that's pretty unusual).
From my research, I cannot find a way to use an ordinal variable in one of these packages and find the trend of a gene's expression without doing a comparison between two groups. The closest I think I could do would be Age 3 vs Age 2 and Age 2 vs Age 1 and then take the union of the differentially expressed genes. Is there a way to find the linear trend (ie Gene A increases with age)? Or can I just fit a negative binomial GLM in R, and adjust all the p values for FDR?
You just use it as a covariate, so something like
The coefficient on age is then change per unit (so, year, month, etc.). You might be able to use an ordered factor too, I've never tried it and don't know how model.matrix() treats it.
Furthermore, would it be better to correct for sex using ComBat or as a factor in the linear regression?
Use a factor for sex rather than ComBat.
Hi Devon Ryan ,
Although it's too late, hope to have your response. you mentioned, "Give linear regression a try first, there are a lot more tools for it. I'd only deal with quantile regression if the results were unsatisfactory". In general, I did not find many papers that used quantile regression in the context of gene expression data. could you please explain to me the reason?
In my work, I found no linear relationship between the eigengene value (obtained by WGCNA) and some quantitative traits, like age or survival. so in this situation, would quantile regression be the correct approach rather than linear regression analysis?
Thanks