One-way ANOVA in R for many observations
2
3
Entering edit mode
6.8 years ago
darzenis ▴ 60

Hello,

I am trying to do one-way ANOVA in R to check for significant variations in biochemical concentrations between treatment groups.

This is how my data table is set up:

Treatment     Biochem_1     Biochem_2
A                   2.33                0.42
A                   3.21                0.56
B                   1.21                1.34
B                   2.11                0.99

I am able to run a simple code for one of the biochemicals (collumns):

biochem_1 <- aov(biochem_1 ~ Treatment, data = data_name)

And then get a summary of stat results:

summary(biochem_1)

The problem is that I have over a hundred biochemicals (collumns) in my data set and I do not want to run the same code and over again by manually changing the collumn name.

How could I write my code to run ANOVA for all the biochemicals?

Thank you!

R • 14k views
ADD COMMENT
12
Entering edit mode
6.8 years ago

Edit: 3rd December 2018

Edit: 27th December 2020

The original solution using a for loop is below. Here is a simpler / quicker solution using lapply:

# create test data
df <- data.frame(
  Treatment=c("A","A","B","B"),
  Biochem_1=c(2.33, 3.21, 1.21, 2.11),
  Biochem_2=c(0.42, 0.56, 1.34, 0.99),
  Biochem_3=c(3,2,1,1))

df
  Treatment Biochem_1 Biochem_2 Biochem_3
1         A      2.33      0.42         3
2         A      3.21      0.56         2
3         B      1.21      1.34         1
4         B      2.11      0.99         1

set Treatment as factor

df$Treatment <- factor(df$Treatment, levels=c("A","B"))

store all formulae in a list

formulae <- lapply(colnames(df)[2:ncol(df)], function(x) as.formula(paste0(x, " ~ Treatment")))

go through list and run aov()

res <- lapply(formulae, function(x) summary(aov(x, data = df)))
names(res) <- format(formulae)
res
$`Biochem_1 ~ Treatment`
            Df Sum Sq Mean Sq F value Pr(>F)
Treatment    1 1.2321  1.2321   3.111   0.22
Residuals    2 0.7922  0.3961               

$`Biochem_2 ~ Treatment`
            Df Sum Sq Mean Sq F value Pr(>F)  
Treatment    1 0.4556  0.4556   12.82 0.0699 .
Residuals    2 0.0711  0.0355                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

$`Biochem_3 ~ Treatment`
            Df Sum Sq Mean Sq F value Pr(>F)  
Treatment    1   2.25    2.25       9 0.0955 .
Residuals    2   0.50    0.25                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

extract just p-values

p <- unlist(lapply(res, function(x) x[[1]]$"Pr(>F)"[1]))
p
Biochem_1 ~ Treatment Biochem_2 ~ Treatment Biochem_3 ~ Treatment 
           0.21983666            0.06989405            0.09546597

data.frame(
  Biochemical = sub(' ~ Treatment', '', names(p)),
  pvalue = p)
                      Biochemical     pvalue
Biochem_1 ~ Treatment   Biochem_1 0.21983666
Biochem_2 ~ Treatment   Biochem_2 0.06989405
Biochem_3 ~ Treatment   Biochem_3 0.09546597


----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

Original answer:

Here you go, Sire:

data_name
  Treatment Biochem_1 Biochem_2
1         A      2.33      0.42
2         A      3.21      0.56
3         B      1.21      1.34
4         B      2.11      0.99

#Ensure that category 'A' is the reference level (does not make much difference here) 
data_name$Treatment <- factor(data_name$Treatment, levels=c("A","B"))

for (i in 2:ncol(data_name)) {
   formula <- as.formula(paste(colnames(data_name)[i], " ~ Treatment", sep=""))
   model <- aov(formula, data = data_name)

   cat("\n-----\n\n")
   cat(colnames(data_name)[i])
   cat("\n")
   print(summary(model))
}

-----

Biochem_1
            Df Sum Sq Mean Sq F value Pr(>F)
Treatment    1 1.2321  1.2321   3.111   0.22
Residuals    2 0.7922  0.3961               

-----

Biochem_2
            Df Sum Sq Mean Sq F value Pr(>F)  
Treatment    1 0.4556  0.4556   12.82 0.0699 .
Residuals    2 0.0710  0.0355                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can also wrap everything in sink() commands so that all output goes directly to a file:

sink("Output.txt")
  My Code
sink()
ADD COMMENT
0
Entering edit mode

Also take a look at my code here to see how you could run many hundreds of thousands of models via parallel processing: R functions edited for parallel processing (see the section entitled Any type of LM, GLM, clogit (here: clogit))

ADD REPLY
1
Entering edit mode

Thank you! This was very helpful!

ADD REPLY
0
Entering edit mode

@Kevin Blighe

Is it possible that I can export a table with two columns: one for the observation and one for its calculated p-value?

Thanks,

ADD REPLY
0
Entering edit mode

Yes, sorry, I have modified the procedure to do this. Start from 'go through list and run aov()'

ADD REPLY
1
Entering edit mode

Thank you so much, The script works perfectly.

Additionally, is it possible to have the pair-wise posthoc test p-value after ANOVA? It will be really helpful to have this statistical test data in the same result table. My apology for this additional question.

Thanks,

ADD REPLY
0
Entering edit mode

like this?

condition data             model      term      contrast null.value estimate conf.low conf.high adj.p.value
  <chr>     <list>           <list>     <chr>     <chr>         <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
1 Biochem_1 <tibble [4 × 2]> <TukeyHSD> Treatment B-A               0   -1.11    -3.82      1.60       0.220 
2 Biochem_2 <tibble [4 × 2]> <TukeyHSD> Treatment B-A               0    0.675   -0.135     1.49       0.0698
3 Biochem_3 <tibble [4 × 2]> <TukeyHSD> Treatment B-A               0   -1.5     -3.65      0.649      0.0954
ADD REPLY
0
Entering edit mode

Hi cpad0112,

Thank you so much for your reply. I would like to have all data in one table like this one (suppose to have 4 variables instead of 2 as the above example):

          pvalue_anova value_constrast_A_vs_B pvalue_constrast_A_vs_C pvalue_constrast_A_vs_D
biochem_1            1                      2                       3                       4
biochem_2            8                      9                      10                      11
biochem_3           15                     16                      17                      18
          pvalue_constrast_B_vs_C pvalue_constrast_B_vs_D pvalue_constrast_C_vs_D
biochem_1                       5                       6                       7
biochem_2                      12                      13                      14
biochem_3                      19                      20                      21

Thanks,

ADD REPLY
0
Entering edit mode

I realize it's been many years but hopefully someone can still answer my question:

Is there any way I can extract the residuals instead of the p values when using the lapply method? I've tried every way I can think of but it just keeps replying "NULL"

Thank you!

ADD REPLY
1
Entering edit mode
6.8 years ago
brandontsinn ▴ 10

Here is a link to a software carpentry workshop on loops (and their faster alternatives) in R, that I think you might find to be useful.

ADD COMMENT

Login before adding your answer.

Traffic: 2374 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6