I have 15 samples, 3 replicates per condition, with intensity values for ~ 9500 proteins. Samples have some missingness (worst case 15%) and imputation has been performed prior to the differential expression analysis.
Taking a look to my imputed data in a pca plot, I consider there are some outliers that may bias my DEA results:
After looking to my DEA results, I find that the contrasts with the red group may be a bit inflated. For example, comparing red group with green group, I get ~ 1500 significant proteins. The people in charge of the project would prefer not to eliminate any replicates due to the small sample size for each condition.
Is it a valid approach if i run DE analysis with all samples, then another one but removing those two outliers, and keep as significant those proteins that overlap for the analysis?
Thanks in advance for any suggestion.
There is a couple of options other than removing them:
1) Use sample weights, for example
arrayWeights()
in limma (see its manual) to downweight outliers in a data-driven fashion.2) Include the replicate information into the design, basically treating each replicate as a batch.
3) Use something like the sva package to estimate surrogate variables which capture unwanted variation, and then include these into the design.
There seems to be a clear condition difference, so I would start with 1) since it is easy and quick to do, and then see what comes out.
To determine if imputation might induce this effect, did you perform a PCA on common proteins before imputation look like ? You can easily achieve it with limma::plotMDS().