Entering edit mode
8.2 years ago
rednalf
▴
90
I have a data frame that looks like this:
ID SEX PHENO HLA1 HLA2 HLA3 KIR1 KIR2 Output PC1
A1 1 1 2 1 1 1 1 4.643453 -0.0062
A2 2 1 1 1 1 1 NA 4.954243 -0.0207
I am interested in analysing the effect of the interaction between the HLAs and the KIRs, depending on several covariates, using R.
In the data set, I have the following variables:
- ID: individual IDs
- SEX: categorical variable having two categories (male and female) = covariate
- HLA (1 to 3): observed HLA phenotype = categorical variable (NA: missing; 1: absent; 2: present) =
- KIR (1 to 2): observed KIR phenotype = categorical variable (NA: missing; 1: absent; 2: present)
- Output: continous quantitative variable
- PC1: continous quantitative variable = covariate
Per KIR/HLA interaction, I was thinking in creating the interaction model using a linear regression model with interaction as follows:
lm(Output ~ HLA + KIR + HLA*KIR + PC1 + SEX)
...where PC1 and SEX are the covariates.
Does someone know if this is a good model since I have both categorical and quantitative variables in my dataset? Should I use another model?
Thank you for your help.
HLA*KIR
is the same asHLA + KIR + HLA:KIR
, so you have a bit of redundancy. Aside from that I don't see anything obviously wrong.Standard practise is to encode your categorical variates as a numerical value (a dummy / indicator variable). R's model formulas will convert categorical variables (ie, factors) into dummies automatically, unfortunately you've encoded HLA (and KIR) in a way that won't permit that. For your mdel formula to work, HLA would have to be encoded as a factor in a single column, presumably this would have 8 levels (None, HLA1, HLA2, HLA3, HLA12, ..., HLA123), although I can't quite follow what the difference between 'missing' and 'absent' is in your description.
The model you have proposed is very complicated, and I would strongly urge you to compare the results against smaller nested models
Thank you for your answer. In the example above, missing (NA) means that no phenotype data is available, while absent (1) means that a certain phenotype is known to be absent for the individual (vs. present). If I change the coding from 1/2 to 0/1, 0 being absent and 1 present can R then convert the categorical variables into dummies automatically?
Not quite. Modifying the dataframe directly wouldn't include the HLA/KIR interaction term and if you want to define your own design matrix, you should use lm.fit() instead of lm().