Question

cell type adjustment - how many cell types to include?

0

Entering edit mode

14 months ago

kaybelter • 0

Hi all,

I have ewas data from saliva samples (pediatric, variety of ages), and I've estimated cell type proportions with Houseman (with both {ewastools} and with {methylclock} which uses {meffil})

How many of the resulting cell types should I use as adjustment variables? I don't see this discussed, even in articles on cell type adjustment in ewas like Accounting for cellular heterogeneity is critical in epigenome-wide association studies, so I assume this is well-known/obvious but I'm not sure how to make this decision! Several cell types are near-zero, and there is close-to-linear dependence even with just a few cell types.

Thank you!

celltypes ewas • 902 views

ADD COMMENT • link updated 14 months ago by LChart 5.0k • written 14 months ago by kaybelter • 0

score 0 · Answer 1 · 2024-03-30

0

Entering edit mode

14 months ago

LChart 5.0k

Because you're using these types as covariates in a linear regression model, and you don't so much care about estimating their coefficients with any precision, you don't care about multicollinearity between them; you can add all of them -- or maybe all of those which have a non-trivial estimate for >25% of samples. If you have more cell types than samples this could become a bigger issue, but I'm assuming you have samples well in excess.

ADD COMMENT • link 14 months ago by LChart 5.0k

0

Entering edit mode

Thanks so much for your reply. That definitely makes some sense to me, but I do end up with convergence issues when I include all cell types. Do you have a feeling for what non-trivial would mean here? My samples are primarily buccal. The other cell types vary from >25% of samples having 0.5%-8%. Do you think 5% is non-trivial? 1%? Something much smaller?

Thank you!!

ADD REPLY • link 14 months ago by kaybelter • 0

0

Entering edit mode

The very rare cell types are unlikely to be contributing substantially to methylation signal, so you can probably drop the lowest-abundance cell types until the total abundance dropped is ~5%.

Alternatively: This is a linear model. Multicollinearity between these covariate indicates that the data are low rank. You can linearly transform your cell proportions into an smaller full-rank matrix, without sacrificing the ability to correct for changes in cell type abundance -- the tradeoff is that you won't be able to interpret the coefficients as corresponding to any individual cell type, but instead to linear combinations of them.

The easiest way to do this is to run PCA on the proportion matrix, and use the sample loadings corresponding to all non-zero eigenvalues (say >0.1).

ADD REPLY • link 14 months ago by LChart 5.0k