Which of PCA or Heatmap plots are better to exclude outlier replicates from normalised microarray or similar datasets?
2
0
Entering edit mode
3.9 years ago
Microuser • 0

Hello,

I have a general question about finding the outliers in microarray data. For my normalised datasets, I have generated the PCA and heatmap plots with samples clustering. My heatmap plot shows the triplicates cluster together. But, looking at PCA plot, on PC1, one replicate might be much further away from the other two replicates, like having two at +60 and the other being at -20 on PC1 vector. On PC1 more than 55% variance is explained (at least) and all the replicates show rather similar position relative to PCA2 on the plot. My question is which of PCA or heatmap plots are more accurate to use for excluding the outliers from the sample and why?

Your opinions are very appreciated. Thank you

microarray outlier pca heatmap • 9.2k views
ADD COMMENT
5
Entering edit mode
3.9 years ago

To determine an outlier is usually a judgement call and is something that comes with experience of having worked on dozens —possibly hundreds— of datasets.

The numbers on the PCA axes are unfortunately not a good metric to use on their own.

PCA

Stat ellipse

You could instead generate a stat ellipse at the 95% confidence level, as I do HERE, where an outlier would be any sample falling outside of it's respective group's ellipse:

En1-VFYn-XMAYKAFN

Z-scores

You could also generate Z-scores from the PC1 values and determine an outlier as anything falling outside |Z|=3 or |Z|=6.

-----------------------

Hierarchical clustering

In a dendrogram, an outlier will lie in its own branch that may extend from the very root of the tree. You can again attempt to quantify these by setting cut-offs based on the distance metric that's used. For example, if a sample branches off into it's own leaf / node at a height of Euclidean Distance of 8, then it may be an outlier.

Take a quick look at what I do here: A: extract dendrogram cluster from pheatmap

-----------------

General

  • Cook's Distance: Cook's Distance is a metric also routinely used in statistics.
  • +/- 1.5 * IQR: This is commonly used in statistics and there is much material online about it
  • Bonferroni test on studentised residuals: If you feel up for it, you can try to implement this, but it depends on your input data. I cannot really see it being used in your case - https://www.rdocumentation.org/packages/car/versions/3.0-10/topics/outlierTest
ADD COMMENT
0
Entering edit mode

Thanks a lot Kevin. I went through the PCAtools tutorial, but I couldn't regenerate the same stat ellipse on the same dataset. I used this tool for my own data, too. Can I say that GSM4910611 is the outlier of sessile group? Besides, I don't know if all the samples of planktonic are within their ellipse and there's no outlier because its ellipse is divided into halves and looks odd. Thank you in advance.

my data stat ellipse: https://ibb.co/Rjd56CM tutorial stat ellipse: https://ibb.co/PYRJ79s

ADD REPLY
1
Entering edit mode

Oh, you need to increase the axes widths so that the ellipses are drawn correctly. I do not see any outliers in your data.

ADD REPLY
1
Entering edit mode
3.9 years ago
Mensur Dlakic ★ 28k

I will preface this by saying that I don't use PCA for the same purpose you do, so my advice may be of limited use to you.

In many machine learning datasets I have handled, the first two PCs are not very reliable in identifying outliers. To digress for a moment, it would be helpful if you had shown the image rather than verbalizing the outcome. Sometimes it is the higher PCs, despite describing only a small fraction of variance, that capture the outliers better. I don't have time or proper knowledge to explain why that is, but I know from experience that it is the case for many diverse datasets. Some of it is explained here and Google will help you find additional info. Yet another option is to try robust PCA algorithms, which are designed to deal with datasets that have corrupt data points. This toolkit may be useful as well.

ADD COMMENT
0
Entering edit mode

Good to get the Python perspective!

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6