Goodmorning everyone,
I hope to be in the right place for this kind of statistical question. I'm working with an ecological index who give be the number of species within different district of the same area. At the same time i collected some climatical information about these 3 areas and I would like to know if there is a statistical way to detect possible influence of climate on biodivediversity level within this area. These are my data example:
Area District N.Species summerMin summerMax summerRain summerCloud summerHumidity summerUV summerSun
Area1 district1 421 9 14.25 36.425 5.5 35.25 4 110
Area1 district2 484 9 14.25 36.425 5.5 35.25 4 110
Area1 district3 325 9 14.25 36.425 5.5 35.25 4 110
Area1 district4 514 9 14.25 36.425 5.5 35.25 4 110
Area1 district5 311 9 14.25 36.425 5.5 35.25 4 110
Area2 district6 265 21.75 31.5 30.5 7.75 52.25 7 146
Area2 district7 356 21.75 31.5 30.5 7.75 52.25 7 146
Area2 district8 327 21.75 31.5 30.5 7.75 52.25 7 146
Area2 district9 176 21.75 31.5 30.5 7.75 52.25 7 146
Area2 district10 208 21.75 31.5 30.5 7.75 52.25 7 146
Area3 district11 454 9 16.75 114.5 33.5 80 3.25 126
Area3 district12 235 9 16.75 114.5 33.5 80 3.25 126
Area3 district13 217 9 16.75 114.5 33.5 80 3.25 126
Area3 district14 257 9 16.75 114.5 33.5 80 3.25 126
I have tried with a linear model but I dont think this could be the best...I get a result just for the first two variables and than only NA, like that:
> summary(gg)
Call:
glm(formula = Observed ~ summerMin + summerMax + summerRain +
summerSun + summerHumidity + N.Species, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-176.73 -65.71 -11.73 70.83 179.75
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 489.71 80.12 6.112 1.35e-06 ***
summerMin 44.90 22.59 1.988 0.0567 .
summerMax -36.99 17.65 -2.096 0.0453 *
summerRain NA NA NA NA
summerSun NA NA NA NA
summerHumidity NA NA NA NA
N.SpeciesArea2 NA NA NA NA
N.SpeciesArea3 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 9018.12)
Null deviance: 297330 on 30 degrees of freedom
Residual deviance: 252507 on 28 degrees of freedom
AIC: 375.14
Number of Fisher Scoring iterations: 2
Any suggestion? which analysis could be the best to perform for my aim? thanks a lot for your help.
Andrea
Good morning / evening Andrea. What is in your column named 'Observed'? Also, can you paste the commands that you have used prior to
summary(gg)
?Obtaining NA values may be a result of there being too few levels for those factor levels returning NA.
Good Morning/Evening Kevin, thanks a lot for your reply. Sorry I made a mistake pasting it. Observed is N.Species and the command line i used before summary gg is that:
I see - thank you for clarifying that. When you put all of the variables together in the model in that way, the amount of information contributing to the model by each variable is 'adjusted' by every other variable present in the model.
I believe that you should be running the modelling this way, by first testing each predictor independently. I also believe that you should be using
lm()
and notglm()
, as your outcome variable is continuous:Upon observing these, pick the predictors that are statistically significant. You can also justify the inclusion of ones that show 'trend' toward statistical significance, i.e., those that have p values between 0.05 and 0.1. Build a 'final' model with these:
With the final model, you should test the model assumptions via various metrics that I have listed here:
One of the most important is assessing the predictive potential via r^2 (r-squared) shrinkage. This assesses the correlation of the model-predicted values versus the actual values, via 3-fold cross validation:
This model is not reliable because the r^2 is sshrunk by too much - it should remain consistent. However, it makes sense that it's a poor model because only created it on a small amount of data (i.e. the data that you pasted).
We can nevertheless then also plot this:
upload image
Really thanks thanks thanks a lot Kevin! your explaination have been really very clear!
following your example I get a significant r2.Shrunk (0.02) but a r2 raw of 0.30 .... what does it means? how can I interpret the fact that is significant but very low r2 value?
Hello, good to hear that.
You do not, necessarily, have to achieve a high value for r2. If you have statistically significant predictors, then it says that there is still a relationship between your predictors and your outcome variable (N.Species), irrespective of a high or low r2. The worrying, thing, however, is the large difference between your raw r2 and your shrunk r2. The model would be more reliable if these were less different.
Can you paste some values for your 'Estimate' (beta coefficient) and the p value from each predictor?