Hi all,
I've used SVA to account for hidden batch effects in my RNA seq experiment where I'm trying to predict disease status (healthy/AC or disease/GBM) and now I'm trying to find out if accounting for these batch effects has improved my clustering. When not accounting for batch effects, I'm getting the following plot: When accounting for batch effects, I'm getting the following plot: Especially variance explained by the principle components has improved but clustering has only somewhat improved. I'm having a hard time interpreting these results. Does this mean that the latter PCA plot is 'good' because much of the variance can be explained? But then why is clustering so bad? I appreciate your help!
There doesn't appear to be an appreciable separation between the two conditions for most of your samples. Instead of immediately going to SVA, have you explored whether another effect such as biological sex, sample collection date, etc. is explaining the separation? You may also want to check more than the first 2 PCs to see if your conditions are separating in other dimensions.
going along with this, consider generating a plot with sex coded as shape, conditions coded as color (as you already have), age coded as dot size, etc etc etc
will help you put it all together. can send you the code if need.
considering removing the two outliers, then re-running the SVA as well.
did you limit to the top 1000 genes or some such? what was your preparatory procedure??