I am interested in finding differentially expressed genes between conditions control/treated. But, I need to control for family-level variation in my GLM model.
My initial approach was ~family+condition. Based on my PCA, I think my family structure is too heterogenous.
I think using just the family as categorical fixed effect may be too simplistic. So, I am considering using the PCA PCs as covariates in my model. ~pc1+pc2+pc3+condition. Is this a good approach? Any examples of something like this?
I'm not seeing the heterogeneity in family structure that you are alluding to. All the 20s cluster together, likewise the 30s, the 27s and the 21s. You might end up losing some genuinely differentially expressed genes if you include PCs as covariates, what with the controls being slightly downandleft of the cases in your PCA plot
Yes, they do cluster fairly well by family, hence the need for family correction. By 'heterogeneity', I mean to say that the individuals are quite spread out within families and not exactly forming a small tight cluster. This is the reason why I was wondering if a PCA covariate approach might be better than a fixed family. I am wondering about the positives and negatives of both approaches.
You would only have to adjust for family structure through PC covariates if the families were blatantly segregated from each other on the PCA bi-plot (like we would adjust for ethnicity through PC covariates in a genetics study). I don't see the need to do anything here - sorry. The dataset looks good to go without any adjustments.
The one thing that you do not want to do is adjust for something for which there is no confounding.
The variation on PC1 is also low (29%).
Edit - if you had a very good clinical reason for including family as a covariate, then just do that as you were originally planning to do. Like, if one family lives at an altitude of 5,000 metres versus another who live at sea-level, or if one family family lives in a polluted city like New York versus another who live in the beautiful countryside in Ireland.
@Kevin Thanks for the input. Yes, I did get the idea from GWAS. I am a bit confused by some part of your comment.
'...there is no confounding.'
In this dataset, I am looking for DEGs between case and control. I am not interested in families. So I would say family is a confounding effect. The family structure is killing all signal for what I am actually interested in: case/control.
Yes, I was aware of the primary focus on case/control.
How do you gauge 'killing all signal' - this PCA bi-plot? If the families are segregated, it's primarily only along PC2 and the variance there is just 9.8%. This indicates, in fact, that only a small proportion of your dataset shows family-related differences.
I don't believe in fact that there is any answer for your question. It will only be resolved by trying various things, including both adjusting and not adjusting for family.
Out of curiosity, too, why are you building a GLM when DESeq2 already provides a differential expression test?
If you build a standard GLM and apply it to normalised RNA-seq counts, the results will be biased because the default setting of glm() in R assume a binomial distributon, whereas RNA-seq counts follow the negative binomial distribution.
I would have included family in the DESeq2 design model
Hi Kevin,
I understand your point that family wide variations are not that big but they definitely seem to bigger than case vs control and they seem to cluster more than families as such for patient 20 and and 30.
What you could use is the R package sva. It can estimate the number and the values of your covariates and ultimately whether it's worth to correct for it or not. It does a better job than PCA, as shown in their paper, avoiding the question of how many principal components you would include in the model.
What i see from your PCA: family has a clear effect, and is responsible for the major part in variance explained in your PCA. Condition seems to have an effect, but dependent on family: I see families 20 and 30 sub-clustering on condition, but not families 21 and 27.
I don't think using the PCs is a good idea, unless you have reason to believe there are batch effects or other kind of artifacts. Using the PCs could overfit the data, without providing a good biological explanation. Also, it is well documented batch correction methods tend to over-correct.
By the way, really nice PCA plot, did you do it in R?
As you mentioned, the covariate family has a strong effect on the data. If you were to analyse the data without correcting for it (~condition) I guess you would find very few significant results. Since the stated aim is to find treatment differences common to all families the way to go is to correct for it (~family+condition), in my opinion. Even better, a continuous value like principal components can take into account things like the fact that 27 and 21 are rather similar and sva has shown to be more accurate than PCA.
A whole different question would be to look for family specific effects.
I would not say "strong". The percent variation explained by PC1 is relatively low compared to, for example, a batch effect, which could be anything up to 90% variation. This is a tricky case because you could run the risk of over-correcting for a confounding variable that only affects a small fraction of your data.
Edit: I would look at the component loadings for both PC1 and PC2 to see which genes, exactly, are driving that source of variation. Then you may better understand what exactly is going on. It may enlighten you if you did this, to be honest.
Edit 2: Remember not to be fooled by a PCA bi-plot. PCA will always segregate your samples but it is the percent variation that is important. If I did PCA on just 2 samples (monozygotic twins), PCA would still segregate their expression profiles, with 1 sample appearing on the left and the other on the right. The percent variation explained by PC1 may only be 1%, though, which is meaningless...
I'm not seeing the heterogeneity in family structure that you are alluding to. All the 20s cluster together, likewise the 30s, the 27s and the 21s. You might end up losing some genuinely differentially expressed genes if you include PCs as covariates, what with the controls being slightly downandleft of the cases in your PCA plot
Yes, they do cluster fairly well by family, hence the need for family correction. By 'heterogeneity', I mean to say that the individuals are quite spread out within families and not exactly forming a small tight cluster. This is the reason why I was wondering if a PCA covariate approach might be better than a fixed family. I am wondering about the positives and negatives of both approaches.
You would only have to adjust for family structure through PC covariates if the families were blatantly segregated from each other on the PCA bi-plot (like we would adjust for ethnicity through PC covariates in a genetics study). I don't see the need to do anything here - sorry. The dataset looks good to go without any adjustments.
The one thing that you do not want to do is adjust for something for which there is no confounding.
The variation on PC1 is also low (29%).
Edit - if you had a very good clinical reason for including family as a covariate, then just do that as you were originally planning to do. Like, if one family lives at an altitude of 5,000 metres versus another who live at sea-level, or if one family family lives in a polluted city like New York versus another who live in the beautiful countryside in Ireland.
@Kevin Thanks for the input. Yes, I did get the idea from GWAS. I am a bit confused by some part of your comment.
'...there is no confounding.'
In this dataset, I am looking for DEGs between case and control. I am not interested in families. So I would say family is a confounding effect. The family structure is killing all signal for what I am actually interested in: case/control.
Yes, I was aware of the primary focus on case/control.
How do you gauge 'killing all signal' - this PCA bi-plot? If the families are segregated, it's primarily only along PC2 and the variance there is just 9.8%. This indicates, in fact, that only a small proportion of your dataset shows family-related differences.
I don't believe in fact that there is any answer for your question. It will only be resolved by trying various things, including both adjusting and not adjusting for family.
Out of curiosity, too, why are you building a GLM when DESeq2 already provides a differential expression test?
If you build a standard GLM and apply it to normalised RNA-seq counts, the results will be biased because the default setting of
glm()
in R assume a binomial distributon, whereas RNA-seq counts follow the negative binomial distribution.I would have included family in the DESeq2 design model
The formula is used in DESeq2. Perhaps I should've mentioned that somewhere.
Hi Kevin, I understand your point that family wide variations are not that big but they definitely seem to bigger than case vs control and they seem to cluster more than families as such for patient 20 and and 30.