Hi all,
So I am looking for an automated approach to detect outliers in RNA-seq data. I usually looked at a PCA plot and decided visually. Now I would like to automate this. So I have been looking at the PcaHubert()
function in rrcov
package, which then flags suspected outliers as false:
pca <- PcaHubert(t(rna.data))
outliers <- which(pca@flag=='FALSE')
Would this be a good option? Or are there others better suited for RNA-seq data?
Thanks for your input!
Hello Kevin. I am seeking automatic outlier detection method but couldn't find yet. Cook's distance looks good but maybe more suitable to detect gene outlier not sample. Isolation forest maybe a good way, I need to try first. And your method Maybe not very suitable for patient(clinical) data which PC1's
Proportion of Variance
may not high for example only around 0.5.Thank you very much for your input! I used voom transformed RSEM values - so log2CPM. Do you know if the algorithm can work with this? Thanks!
Hi friend,
You may want to check the distribution of the data with the
hist()
function in R, and then share the figure here (or just decide yourself if its normally distributed). To share an image here, just upload here, and then share the URL in your comment/reply.In the manual for rrcov, which I believe is used by PcaHubert, they state:
[source: https://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf]
So, it looks like having your data normally distributed would be optimal.
Hello Kevin, Is there any chance you remember which paper used this PC1zscore >|3| method? I would like to read and/or cite it. Thanks!
Hey, I do not have any citations - it is just a general way to detect outliers. It would likely only appear in supplementary methods, or not at all. I think that it is okay to justify the removal of outliers by eye, too.
In most statements, people would write: "X samples were removed after visual inspection of a PCA bi-plot"
Kevin Blighe thank is a great answer. We are having the same issue with outliers. With the 2nd part using > 3 SD; do you have a reference for this I can cite as well? thank in advance.
There is actually, well, a sort of citation off the top of my head. Please see the PLINK documentation (section 'Outlier detection diagnostics'): https://zzz.bwh.harvard.edu/plink/strat.shtml
PLINK is as good a reference as any.
thanks much appreciated. I also found a good method by just doing a tukey outlier method.