R: Linear model with interaction: quantitative variable that depends on both quantitative and qualitative variables.
0
1
Entering edit mode
8.3 years ago
rednalf ▴ 90

I have a data frame that looks like this:

ID  SEX  PHENO  HLA1  HLA2  HLA3  KIR1  KIR2      Output        PC1
A1    1      1     2     1     1     1     1    4.643453    -0.0062   
A2    2      1     1     1     1     1    NA    4.954243    -0.0207

I am interested in analysing the effect of the interaction between the HLAs and the KIRs, depending on several covariates, using R.

In the data set, I have the following variables:

  • ID: individual IDs
  • SEX: categorical variable having two categories (male and female) = covariate
  • HLA (1 to 3): observed HLA phenotype = categorical variable (NA: missing; 1: absent; 2: present) =
  • KIR (1 to 2): observed KIR phenotype = categorical variable (NA: missing; 1: absent; 2: present)
  • Output: continous quantitative variable
  • PC1: continous quantitative variable = covariate

Per KIR/HLA interaction, I was thinking in creating the interaction model using a linear regression model with interaction as follows:

lm(Output ~ HLA + KIR + HLA*KIR + PC1 + SEX)

...where PC1 and SEX are the covariates.

Does someone know if this is a good model since I have both categorical and quantitative variables in my dataset? Should I use another model?

Thank you for your help.

SNP Statistics LinearModel Interaction Variables • 4.0k views
ADD COMMENT
0
Entering edit mode

HLA*KIR is the same as HLA + KIR + HLA:KIR, so you have a bit of redundancy. Aside from that I don't see anything obviously wrong.

ADD REPLY
0
Entering edit mode

Standard practise is to encode your categorical variates as a numerical value (a dummy / indicator variable). R's model formulas will convert categorical variables (ie, factors) into dummies automatically, unfortunately you've encoded HLA (and KIR) in a way that won't permit that. For your mdel formula to work, HLA would have to be encoded as a factor in a single column, presumably this would have 8 levels (None, HLA1, HLA2, HLA3, HLA12, ..., HLA123), although I can't quite follow what the difference between 'missing' and 'absent' is in your description.

The model you have proposed is very complicated, and I would strongly urge you to compare the results against smaller nested models

ADD REPLY
0
Entering edit mode

Thank you for your answer. In the example above, missing (NA) means that no phenotype data is available, while absent (1) means that a certain phenotype is known to be absent for the individual (vs. present). If I change the coding from 1/2 to 0/1, 0 being absent and 1 present can R then convert the categorical variables into dummies automatically?

ADD REPLY
0
Entering edit mode

Not quite. Modifying the dataframe directly wouldn't include the HLA/KIR interaction term and if you want to define your own design matrix, you should use lm.fit() instead of lm().

ADD REPLY

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6