Question

R: Linear model with interaction: quantitative variable that depends on both quantitative and qualitative variables.

1

Entering edit mode

8.3 years ago

rednalf ▴ 90

I have a data frame that looks like this:

ID  SEX  PHENO  HLA1  HLA2  HLA3  KIR1  KIR2      Output        PC1
A1    1      1     2     1     1     1     1    4.643453    -0.0062   
A2    2      1     1     1     1     1    NA    4.954243    -0.0207

I am interested in analysing the effect of the interaction between the HLAs and the KIRs, depending on several covariates, using R.

In the data set, I have the following variables:

ID: individual IDs
SEX: categorical variable having two categories (male and female) = covariate
HLA (1 to 3): observed HLA phenotype = categorical variable (NA: missing; 1: absent; 2: present) =
KIR (1 to 2): observed KIR phenotype = categorical variable (NA: missing; 1: absent; 2: present)
Output: continous quantitative variable
PC1: continous quantitative variable = covariate

Per KIR/HLA interaction, I was thinking in creating the interaction model using a linear regression model with interaction as follows:

lm(Output ~ HLA + KIR + HLA*KIR + PC1 + SEX)

...where PC1 and SEX are the covariates.

Does someone know if this is a good model since I have both categorical and quantitative variables in my dataset? Should I use another model?

Thank you for your help.

SNP Statistics LinearModel Interaction Variables • 4.0k views

ADD COMMENT • link 8.3 years ago by rednalf ▴ 90

0

Entering edit mode

HLA*KIR is the same as HLA + KIR + HLA:KIR, so you have a bit of redundancy. Aside from that I don't see anything obviously wrong.

ADD REPLY • link 8.3 years ago by Devon Ryan 105k

0

Entering edit mode

Standard practise is to encode your categorical variates as a numerical value (a dummy / indicator variable). R's model formulas will convert categorical variables (ie, factors) into dummies automatically, unfortunately you've encoded HLA (and KIR) in a way that won't permit that. For your mdel formula to work, HLA would have to be encoded as a factor in a single column, presumably this would have 8 levels (None, HLA1, HLA2, HLA3, HLA12, ..., HLA123), although I can't quite follow what the difference between 'missing' and 'absent' is in your description.

The model you have proposed is very complicated, and I would strongly urge you to compare the results against smaller nested models

ADD REPLY • link 8.3 years ago by russhh 5.7k

0

Entering edit mode

Thank you for your answer. In the example above, missing (NA) means that no phenotype data is available, while absent (1) means that a certain phenotype is known to be absent for the individual (vs. present). If I change the coding from 1/2 to 0/1, 0 being absent and 1 present can R then convert the categorical variables into dummies automatically?

ADD REPLY • link 8.3 years ago by rednalf ▴ 90

0

Entering edit mode

Not quite. Modifying the dataframe directly wouldn't include the HLA/KIR interaction term and if you want to define your own design matrix, you should use lm.fit() instead of lm().

ADD REPLY • link 8.3 years ago by russhh 5.7k