Question

LInear Models with HLA-B alleles in R

0

Entering edit mode

2.5 years ago

schagas • 0

Hello,

I think this question might be more a statistics understanding than a code troubleshooting. Still, I am posting here because there are more R-users with a better knowledge of biological applications.

I want to understand some differences to represent the allele presence/absence with lm() function in R. As a dependent variable, I have ICU_days period, which I am trying to relate to HLA-B alleles. So, as an independent variable, I have HLA-B alleles adjusted for Age.

The first representation would be a dataset containing all alleles present in my sample as lm_designmatrix2:

id ICU_days Age allele
1         49  74   B*35
2         45  73   B*07
3         20  63   B*07
4         52  55   B*08
5         29  46   B*07
6         66  77   B*38
7         44  76   B*08
8         76  65   B*08
9         47  34   B*14
10        25  58   B*13
11        30  57   B*35
12         7  80   B*07
13        30  27   B*15
14        28  79   B*40
15         2  68   B*07
16        20  56   B*07
17        20  85   B*07
18        24  58   B*35
19        31  47   B*44
20        11  58   B*07

The linear model:

summary(lm(ICU_days ~ allele + Age, data = lm_designmatrix2))

Call:
lm(formula = ICU_days ~ allele + Age, data = lm_designmatrix2)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.627 -11.032  -2.047   7.942  63.348 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  18.37896    7.75141   2.371   0.0190 * 
alleleB*08    7.79160    6.49281   1.200   0.2321   
alleleB*13   20.30319    7.62279   2.663   0.0086 **
alleleB*14   10.58582    7.56816   1.399   0.1640   
alleleB*15   13.78044    6.81690   2.022   0.0450 * 
alleleB*18    7.68169    7.56604   1.015   0.3116   
alleleB*27  -17.64699   13.32670  -1.324   0.1875   
alleleB*35    8.23656    5.62620   1.464   0.1453   
alleleB*37   35.60011   18.31386   1.944   0.0538 . 
alleleB*38   25.62458    9.83688   2.605   0.0101 * 
[...] etc

Age           0.05827    0.10997   0.530   0.5970   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.8 on 147 degrees of freedom
Multiple R-squared:  0.1607,    Adjusted R-squared:  0.02362 
F-statistic: 1.172 on 24 and 147 DF,  p-value: 0.2766

However, even though I know that is a dumb question, I didn't realize the difference between the above model with a dataset with dummy variables for each allele:

ICU_days   Age `B*07` `B*08` `B*13` `B*14`
      <int> <int>  <int>  <int>  <int>  <int>
 1       30    27      0      0      0      0
 2       47    34      0      0      0      1
 3       21    35      0      0      0      0
 4       22    37      1      0      0      0
 5        3    39      0      0      0      0
 6       14    39      1      1      0      0
 7       49    39      0      0      0      1
 8       84    39      1      0      1      0
 9       30    40      0      0      0      0
10       32    40      0      0      0      0

And it performs linear modeling for each allele separately, like this:


summary(lm(ICU_days ~ Age + `B*13`, data = lm_designmatrix))

Call:
lm(formula = ICU_days ~ Age + `B*13`, data = lm_designmatrix)

Residuals:
   Min     1Q Median     3Q    Max 
-27.29 -14.72  -3.76  10.74  47.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 22.70441    8.70438   2.608   0.0108 *
Age          0.09689    0.14743   0.657   0.5129  
`B*13`      14.02777    6.68684   2.098   0.0390 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.8 on 83 degrees of freedom
Multiple R-squared:  0.05161,   Adjusted R-squared:  0.02876 
F-statistic: 2.258 on 2 and 83 DF,  p-value: 0.1109

Could anyone explain how to interpret these different representations?

R HLA • 974 views

ADD COMMENT • link updated 2.5 years ago by LChart 4.8k • written 2.5 years ago by schagas • 0

score 2 · Answer 1 · 2022-08-08

2

Entering edit mode

2.5 years ago

LChart 4.8k

Your first approach can be visualized as this picture:

Multi-linear regression

With the "(Intercept)" coefficient, in your case, corresponding to the mean of Group 1.

The second approach can be visualized as this sequence of pictures:

Marginal regression 1

Marginal regression 2

Marginal regression 3

where your "(Intercept)" corresponds to the mean value of "not_group". Note that if you are interested in the effect of an HLA allele compared to other HLA alleles, the second approach can be problematic, as the variance of "other group" is high due to mixing HLA alleles together into one group.

For instance, based on the last plot (and the corresponding regression), you may (incorrectly) infer that Group 3 is "not all that different" from other HLA groups, when in fact the first figure provides a clear readout that it is visually different from groups 1,2,4 and 7.

ADD COMMENT • link 2.5 years ago by LChart 4.8k

0

Entering edit mode

Thanks LChart ! So would the second approach be a loss in the "accuracy" of the estimates?

And Anyway, the two ways are not wrong, since depending on the statistical question can I choose one of them?

ADD REPLY • link 2.5 years ago by schagas • 0

1

Entering edit mode

The estimates are accurate in both cases, but as you say, the questions are different. It is fairly common for contrasts of the second form to be used to make statements about contrasts of the first form -- such usage is inappropriate.

ADD REPLY • link 2.5 years ago by LChart 4.8k