Hello,
I’ve been studying causal inference recently, but I’m still unsure how to properly approach my analysis — so I would really appreciate your guidance. I’m working with the following dataset and aim to answer this question:
Goal: For each individual, can we predict whether Treatment A or Treatment B would be more effective?
Dataset Summary: N = 88 patients
Treatment assignment: A or B (binary)
Outcome: binary response (1 = favorable response, 0 = unfavorable)
Covariates:
A binary variable for the presence of a specific gene mutation
A continuous variable for the expression level of a specific gene
Questions Since this is a small dataset (n=88), would it still make sense to split the data into training and test sets, as in conventional supervised learning workflows?
I am considering using causal_forest() from the grf package to estimate individual treatment effects (ITEs).
After estimating the ITEs, is it reasonable to decide:
ITE > 0 => Prefer Treatment A
ITE < 0 => Prefer Treatment B
Is this interpretation valid and commonly used in practice?
I’m aware that with such a small sample size, variance and overfitting could be major issues. If there are any recommendations regarding cross-validation strategies, feature regularization, or alternative models (e.g., T-Learner, S-Learner), I’d love to hear them.
Thank you very much in advance for your help!