I am running a multiple linear regression model using lm function in R. to study the impact of some characteristics on the gene expression level. my data matrix contains one continuous dependent variable (i.e. gene expression levels) and 50 explanatory variables which are the count of these characteristics on each gene and many of these counts are zeros. I checked all of the regression assumptions and I found two issues the first one is the Heteroskedasticity and the other one is the autocorrelation problem. the later is not series. I wonder I using the multiple linear regression is correct or no and if there is any other regression techniques can be used to solve these problems.
I suggest using
lmFit
from limma for multiple linear regression andplotSA
for assumption check. Usually autocorrelation is not a problem and limma has way to deal with heteroskedasticity witharrayweights
. But I'm cufused when you say one continuous dependent variable (i.e. gene expression levels) and 50 explanatory variables which are the count of these characteristics on each gene and many of these counts are zeros. Why you want to use linear regression to study the impact directly rather than doing some exploratory analysis (PCA, MDS, etc), see the correlation of top PCs with your variable of interest and then incorporate highly correlated variables in your design matrix ? I don't think it is a good idea to have 50 variables in your linear model unless you have large effective size and thus enough degree of freedoms for fitting.I used stepwise method and I got just 11 significant variables among those 50. But when I checked the heteroskidacty, I found it still appear in the reduced model. I never used limma package, but I will try it and see what will the results of it. I used PCA before but the only issue with it you can't know which are the variables names in each of the components to use them for the model fitting and l wonder if there is any way to know the variables names in each component (i.e. Return the variables names back). BTW my sample size is 15,000 genes. (15,000 rows and 50 columns). Yes, I have the gene expression as the dependent variable and 50 independent variables and I count how many of these characteristics represent in each gene. The image below shows is one of model diagnostic related the heteroskidacty problem.
Your plot is not bad. PCA works with linear combinations of gene expression value whose variances are largest (PC1), second largest (PC2), ... across all samples. You will not know the variables names in each component. You need to visualize your PCA plot (say PC1 vs PC2), usually labeled your points with different colors corresponding to different variables and see how well samples are seperated along PCs. This is equivalent to what I said before, see the correlation of top PCs with your variable of interest. Here is an example. Another point you should pay attention to is the variance explained by each PC, usually it will decrease fast and you would not expect all of your 50 variables have similar impact on gene expression.
You could make a PCA from you 50 variable then use the principal components to build your regression. This should get rid of the autocorrelation issue
I used PCA before for variables reduction, but the only issue with it you can't know which are those variables names in each of the components to use them for the model fitting and l wonder if there is any way to know the variables names in each component (i.e. Return the variables names back). It will be easy if I know those significant variables in each components. I wonder if someone help me with this issue.
You can have a look at the
loadings()
, the correlation of the original variables with the new components. Example here.