Question

How to fit climatic data with ecological index?

0

Entering edit mode

7.0 years ago

AQ7 ▴ 30

Goodmorning everyone,

I hope to be in the right place for this kind of statistical question. I'm working with an ecological index who give be the number of species within different district of the same area. At the same time i collected some climatical information about these 3 areas and I would like to know if there is a statistical way to detect possible influence of climate on biodivediversity level within this area. These are my data example:

Area    District    N.Species   summerMin   summerMax   summerRain  summerCloud summerHumidity  summerUV    summerSun
Area1   district1   421 9   14.25   36.425  5.5 35.25   4   110
Area1   district2   484 9   14.25   36.425  5.5 35.25   4   110
Area1   district3   325 9   14.25   36.425  5.5 35.25   4   110
Area1   district4   514 9   14.25   36.425  5.5 35.25   4   110
Area1   district5   311 9   14.25   36.425  5.5 35.25   4   110
Area2   district6   265 21.75   31.5    30.5    7.75    52.25   7   146
Area2   district7   356 21.75   31.5    30.5    7.75    52.25   7   146
Area2   district8   327 21.75   31.5    30.5    7.75    52.25   7   146
Area2   district9   176 21.75   31.5    30.5    7.75    52.25   7   146
Area2   district10  208 21.75   31.5    30.5    7.75    52.25   7   146
Area3   district11  454 9   16.75   114.5   33.5    80  3.25    126
Area3   district12  235 9   16.75   114.5   33.5    80  3.25    126
Area3   district13  217 9   16.75   114.5   33.5    80  3.25    126
Area3   district14  257 9   16.75   114.5   33.5    80  3.25    126

I have tried with a linear model but I dont think this could be the best...I get a result just for the first two variables and than only NA, like that:

> summary(gg)

Call:
glm(formula = Observed ~ summerMin + summerMax + summerRain + 
    summerSun + summerHumidity + N.Species, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-176.73   -65.71   -11.73    70.83   179.75  

Coefficients: (5 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      489.71      80.12   6.112 1.35e-06 ***
summerMin         44.90      22.59   1.988   0.0567 .  
summerMax        -36.99      17.65  -2.096   0.0453 *  
summerRain           NA         NA      NA       NA    
summerSun            NA         NA      NA       NA    
summerHumidity       NA         NA      NA       NA    
N.SpeciesArea2          NA         NA      NA       NA    
N.SpeciesArea3          NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 9018.12)

    Null deviance: 297330  on 30  degrees of freedom
Residual deviance: 252507  on 28  degrees of freedom
AIC: 375.14

Number of Fisher Scoring iterations: 2

Any suggestion? which analysis could be the best to perform for my aim? thanks a lot for your help.

Andrea

R linear models • 1.6k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 7.0 years ago by AQ7 ▴ 30

0

Entering edit mode

Good morning / evening Andrea. What is in your column named 'Observed'? Also, can you paste the commands that you have used prior to summary(gg)?

Obtaining NA values may be a result of there being too few levels for those factor levels returning NA.

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Good Morning/Evening Kevin, thanks a lot for your reply. Sorry I made a mistake pasting it. Observed is N.Species and the command line i used before summary gg is that:

glm(formula = N.Species ~ summerMin + summerMax + summerRain + summerSun + summerHumidity, data = data)

ADD REPLY • link 7.0 years ago by AQ7 ▴ 30

1

Entering edit mode

I see - thank you for clarifying that. When you put all of the variables together in the model in that way, the amount of information contributing to the model by each variable is 'adjusted' by every other variable present in the model.

I believe that you should be running the modelling this way, by first testing each predictor independently. I also believe that you should be using lm() and not glm(), as your outcome variable is continuous:

ggMin <- lm(N.Species ~ summerMin, data=data)
ggMax <- lm(N.Species ~ summerMax, data=data)
ggRain <- lm(N.Species ~ summerRain, data=data)
ggSun <- lm(N.Species ~ summerSun, data=data)
ggHumidity <- lm(N.Species ~ summerHumidity, data=data)
summary(ggMin)
summary(ggMax)
summary(ggRain)
summary(ggSun)
summary(ggHumidity)

Upon observing these, pick the predictors that are statistically significant. You can also justify the inclusion of ones that show 'trend' toward statistical significance, i.e., those that have p values between 0.05 and 0.1. Build a 'final' model with these:

ggFinal <- lm(N.Species ~ summerMax + summerSun, data=data)

With the final model, you should test the model assumptions via various metrics that I have listed here:

A: Resources for gene signature creation

One of the most important is assessing the predictive potential via r^2 (r-squared) shrinkage. This assesses the correlation of the model-predicted values versus the actual values, via 3-fold cross validation:

#Assessing R2 shrinkage using 3-Fold Cross-Validation 
library(bootstrap)
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
xvals <- as.matrix(data[,which(colnames(data) %in% names(summary(ggFinal)$coefficients[,1]))])
yvals <- as.matrix(data[,c("N.Species")]) 
results <- crossval(xvals, yvals, theta.fit, theta.predict, ngroup=3)
#Raw R2 
r2 <- cor(yvals, ggFinal$fitted.values)**2
#Cross-validated R2
r2.Shrunk <- cor(yvals, results$cv.fit)**2
paste(r2.Shrunk)
paste(r2)

This model is not reliable because the r^2 is sshrunk by too much - it should remain consistent. However, it makes sense that it's a poor model because only created it on a small amount of data (i.e. the data that you pasted).

We can nevertheless then also plot this:

data$pred = predict(ggFinal)

library(ggplot2)
p1 <- ggplot(data, aes(x=N.Species, y=pred)) +
    geom_point() +

    #xlim(0,56) + ylim(0,56) +

    geom_smooth(method="lm", formula=y~x) +

    stat_smooth(method="lm", fullrange=TRUE, colour="royalblue") +

    #Set the size of the plotting window
    theme_bw(base_size=24) +

    #Modify various aspects of the plot text and legend
    theme(
        legend.position="none",
        legend.background=element_rect(),
        plot.title=element_text(angle=0, size=12, face="bold", vjust=1),

        axis.text.x=element_text(angle=0, size=12, face="bold", vjust=0.5),
        axis.text.y=element_text(angle=0, size=12, face="bold", vjust=0.5),
        axis.title=element_text(size=12),

        #Legend
        legend.key=element_blank(),     #removes the border
        legend.key.size=unit(1, "cm"),      #Sets overall area/size of the legend
        legend.text=element_text(size=12),  #Text size
        title=element_text(size=12)) +      #Title text size

        #Change the size of the icons/symbols in the legend
        guides(colour=guide_legend(override.aes=list(size=2.5))) +

    #Set x- and y-axes labels
    xlab("Number of species") +
    ylab("Model-predicted number of species") +

    geom_label(x=350, y=450, size=5, family="sans", parse=TRUE, label=paste("~r^{2}~", r2)) +
    geom_label(x=350, y=465, size=5, family="sans", parse=TRUE, label=paste("~Cross~validated~r^{2}~", r2.Shrunk)) +

    ggtitle("Title")
plot(p1)

upload image

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

1

Entering edit mode

Really thanks thanks thanks a lot Kevin! your explaination have been really very clear!

ADD REPLY • link 7.0 years ago by AQ7 ▴ 30

0

Entering edit mode

following your example I get a significant r2.Shrunk (0.02) but a r2 raw of 0.30 .... what does it means? how can I interpret the fact that is significant but very low r2 value?

ADD REPLY • link 7.0 years ago by AQ7 ▴ 30

0

Entering edit mode

Hello, good to hear that.

You do not, necessarily, have to achieve a high value for r2. If you have statistically significant predictors, then it says that there is still a relationship between your predictors and your outcome variable (N.Species), irrespective of a high or low r2. The worrying, thing, however, is the large difference between your raw r2 and your shrunk r2. The model would be more reliable if these were less different.

Can you paste some values for your 'Estimate' (beta coefficient) and the p value from each predictor?

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k