Hello,
I think this question might be more a statistics understanding than a code troubleshooting. Still, I am posting here because there are more R-users with a better knowledge of biological applications.
I want to understand some differences to represent the allele presence/absence with lm()
function in R. As a dependent variable, I have ICU_days
period, which I am trying to relate to HLA-B alleles. So, as an independent variable, I have HLA-B alleles
adjusted for Age
.
The first representation would be a dataset containing all alleles present in my sample as lm_designmatrix2
:
id ICU_days Age allele
1 49 74 B*35
2 45 73 B*07
3 20 63 B*07
4 52 55 B*08
5 29 46 B*07
6 66 77 B*38
7 44 76 B*08
8 76 65 B*08
9 47 34 B*14
10 25 58 B*13
11 30 57 B*35
12 7 80 B*07
13 30 27 B*15
14 28 79 B*40
15 2 68 B*07
16 20 56 B*07
17 20 85 B*07
18 24 58 B*35
19 31 47 B*44
20 11 58 B*07
The linear model:
summary(lm(ICU_days ~ allele + Age, data = lm_designmatrix2))
Call:
lm(formula = ICU_days ~ allele + Age, data = lm_designmatrix2)
Residuals:
Min 1Q Median 3Q Max
-29.627 -11.032 -2.047 7.942 63.348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.37896 7.75141 2.371 0.0190 *
alleleB*08 7.79160 6.49281 1.200 0.2321
alleleB*13 20.30319 7.62279 2.663 0.0086 **
alleleB*14 10.58582 7.56816 1.399 0.1640
alleleB*15 13.78044 6.81690 2.022 0.0450 *
alleleB*18 7.68169 7.56604 1.015 0.3116
alleleB*27 -17.64699 13.32670 -1.324 0.1875
alleleB*35 8.23656 5.62620 1.464 0.1453
alleleB*37 35.60011 18.31386 1.944 0.0538 .
alleleB*38 25.62458 9.83688 2.605 0.0101 *
[...] etc
Age 0.05827 0.10997 0.530 0.5970
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.8 on 147 degrees of freedom
Multiple R-squared: 0.1607, Adjusted R-squared: 0.02362
F-statistic: 1.172 on 24 and 147 DF, p-value: 0.2766
However, even though I know that is a dumb question, I didn't realize the difference between the above model with a dataset with dummy variables for each allele:
ICU_days Age `B*07` `B*08` `B*13` `B*14`
<int> <int> <int> <int> <int> <int>
1 30 27 0 0 0 0
2 47 34 0 0 0 1
3 21 35 0 0 0 0
4 22 37 1 0 0 0
5 3 39 0 0 0 0
6 14 39 1 1 0 0
7 49 39 0 0 0 1
8 84 39 1 0 1 0
9 30 40 0 0 0 0
10 32 40 0 0 0 0
And it performs linear modeling for each allele separately, like this:
summary(lm(ICU_days ~ Age + `B*13`, data = lm_designmatrix))
Call:
lm(formula = ICU_days ~ Age + `B*13`, data = lm_designmatrix)
Residuals:
Min 1Q Median 3Q Max
-27.29 -14.72 -3.76 10.74 47.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.70441 8.70438 2.608 0.0108 *
Age 0.09689 0.14743 0.657 0.5129
`B*13` 14.02777 6.68684 2.098 0.0390 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.8 on 83 degrees of freedom
Multiple R-squared: 0.05161, Adjusted R-squared: 0.02876
F-statistic: 2.258 on 2 and 83 DF, p-value: 0.1109
Could anyone explain how to interpret these different representations?
Thanks LChart ! So would the second approach be a loss in the "accuracy" of the estimates?
And Anyway, the two ways are not wrong, since depending on the statistical question can I choose one of them?
The estimates are accurate in both cases, but as you say, the questions are different. It is fairly common for contrasts of the second form to be used to make statements about contrasts of the first form -- such usage is inappropriate.