nico - PCA is a good start. What you are figuring out with a PCA plot is, roughly, whether there are clear differences among samples that might complicate an analysis, and, also, if any samples look anomalous.
Im both cases, what you need to determine to proceed is, what is the technical or biological phenomenon that is producing the difference?
One way to do this is to extensively annotate your data, then correlate your PC loadings against all the other potential covariates you have in your study. Consider the following code:
AddPcLoadingsToDESeqColData<-function(DESeq_Object) {
rv<-rowVars(assay(DESeq_Object))
select<-order(rv, decreasing=TRUE)[seq_len(min(1000, length(rv)))] # select the ntop genes by variance
pca<-prcomp(t(assay(DESeq_Object)[select, ])) # perform a PCA on the data in assay(x) for the selected genes
percentVar<-pca$sdev^2 / sum( pca$sdev^2 ) # the contribution to the total variance for each component
PC_df_info<-data.frame(PC1=pca$x[,1])
i=2
while (percentVar[i] > 0.01) {
PC_df_info[[ paste0("PC", as.character(i) ) ]]<-pca$x[, i]
i = i+1
}
colData(DESeq_Object)<-cbind(colData(DESeq_Object), PC_df_info)
metadata(DESeq_Object)$percentVar<-percentVar
return(DESeq_Object)
}
AllColumnCorrelations<-function(InputDF) {
InputDF <- InputDF %>% mutate_if(is.character,as.factor)
InputDF <- InputDF %>% mutate_if(is.factor,as.numeric)
InputDF <- InputDF %>% mutate_if(is.integer,as.numeric)
r_values <- combn(seq_along(InputDF), 2, function(x) cor(InputDF[[x[1]]], InputDF[[x[2]]], method = 'pearson', use="complete.obs"))
cov_names <- combn(names(InputDF), 2, paste0, collapse = '-')
OutputDF <- data.frame(cov_names, r_values)
OutputDF<-separate(data = OutputDF, col = cov_names, into = c("Cov1", "Cov2"), sep = "-")
OutputDF$r_values<-round(OutputDF$r_values,4)
OutputDF<-OutputDF[order(-abs(OutputDF$r_values)),]
OutputDF<-OutputDF[abs(OutputDF$r_values) > 0.05,]
return(OutputDF)
}
Given a matrix such as coldata(dds), this function will compute pearson's r for all pairs of columns. While this is not appropriate in all cases, it is a good sanity check to investigate whether the covariates you intend to include in your model are collinear or not.
# first add Pcs to the metadata:
dds<-AddPcLoadingsToDESeqColData(dds)
CollinearCovScan<-AllColumnCorrelations(as.data.frame(GlassMdUpdate)); CollinearCovScan
# then find correlations between all pairs of Covs in colData(dds)
returns a Df formatted as follows:
CollinearCovScan[grepl( 'COX', paste(CollinearCovScan$Cov1, ' - ', CollinearCovScan$Cov2),ignore.case=TRUE),]
Cov1 Cov2 r_values
2175 COX4I1RawCts COX4I1Status -0.7720
2174 COX4I1RawCts COX4I1NormCts 0.7062
2141 sizeFactor COX4I1RawCts 0.6737
2206 COX4I1NormCts COX4I1Status -0.6067
2143 sizeFactor COX4I1Status -0.5464
2208 COX4I1NormCts PC2 -0.4897
2237 COX4I1Status PC1 -0.4292
778 tumor_pair_barcode COX4I1RawCts 0.4118
since you are attempting to get info. out of the PCA, in your case, you would be looking at whether the PC loadings for a given PC significantly correlate with a variable in your metadata.
If they do, the approach depends on how much covariance they have, how important the variable is to your model, etc. If including the PC in the model allows for control of variance that you suspect is a real (batch or other) effect that you otherwise would not be able to partial out, you could include as a covariate. Alternatively, suppose you have two groups of PC1 loadings, low and high. If there is a problem using that PC in the model, you might decide instead to stratify your model, then do hypothesis testing on the two groups of samples separately.
Finally, for datasets that you expect are still behaving badly for one reason or another, I recommend trying something like GSVA. Note that non-trivial effort has been devoted to deriving latent variables such as may account for batch effect. See, for instance, the gsva
package.
Finally, for your specific case, the problem is that the point belonging to group C is just WAY out there. Not only is it not near other observations, but it is not even near to other members of group C. As such, I would likely observe it via distance matrices, and in other ways, to see if it conforms to your expectations for this type of data (despite its PCs). If it is otherwise OK, perhaps you keep it. However, if by contrast singling out this sample helps you to remeber that this was actually the sample that someone may have contaminated, now you have better reason to exclude it...
Any ideas as to why this may be the case from experimental side of things? Were the sample prepared differently? Were they all sequenced on the same kind of FC/pore/chemistry and on the same/different minION?
Samples were prepared in the same way, but on different days. They were sequenced on the same minion but with different chips (for some reason we can't understand we cannot recover pores after washing the chips)
I would recommend barcoding and running one library on multiple flow cells, next time.
Yes, same pore version.
We barcoded the samples and run each experiment on a different flowcell. Are you suggesting to split all replicates on all flowcells instead?
Were the flowcells using the same pore versions?