The assumptions of a linear regression model are that the residuals are distributed with a Gaussian distribution and that they have equal variance.
Therefore, you have to check for the homogeneity of variance of residuals: if you are doing it with R, you can do a plot of the residuals as function of fitted values (as you have done, I think) and so check firstly by eye if there are dependencies between residuals and fitted values and then also you could do a Levene's test (that is a specific test for the homogeneity of variance).
Finally, for the other assumption, you can do a Quantile-Quantile plot, that is a plot of empirical quantiles (of the distribution relative to the residuals, of course) as function of theoretical quantiles (that are relative to a Gaussian distribution). If the residuals are gaussianly distributed, the data points of such plot will be on the bisector.
You can obtain both the plots by doing plot(your_linearmodel)
.
In my opinion, all these things are well described here. In the case of your plot, it seems that the residuals have a sort of dependence with the fitted values, so the first assumption that I mentioned could be violated. But I suggest to do plot(your_linearmodel)
, as said before and as the others also suggest.
you need to produce other diagnostic plots. use the commend
plot(your_linear_model(specification_and_data))
there will be 4 plots producedShow your input data. Are they integers instead of continuous numbers?
Thanks for your reply. Input data are the average (mean) of three integers. For example, a participant could have three scores of 5,5 and 6 - the mean of the three scores is the input data for each participant.
So do you have many repeated values across your observations? And is this for the Y variable or the X variable? are both the variables this type of data? I think the plot could be related to having repeated/non-continuous values, because you have a low number of possible values. For example, I can generate a similar plot by fitting integers:
EDIT: I just found a discussion on a similar issue here.
Thanks for the link and help above - this was really helpful!
To answer your questions, I have (up to) three repeat values for each observation, but I have created a mean of these so it is as one value per observation. This is for the outcome variable. The predictor and covariates have only one value for each observation, no mean was created.
I think you are right about the parallel lines on the plot reflecting the fact that the outcome variable is non-continuous and can only take low number of possible values.
In a textbook (Andy Field Discovering Statistics) I have seen that you can do a robust regression if assumptions are violated - that is a regression with bootstrapping. Would a robust regression (i.e. bootstrapping) be useful here?
I have little experience with the type of data you describe, maybe others can chime in.