Question

Automatic Outlier Detection for RNA-seq data

1

Entering edit mode

7.1 years ago

JJ ▴ 710

Hi all,

So I am looking for an automated approach to detect outliers in RNA-seq data. I usually looked at a PCA plot and decided visually. Now I would like to automate this. So I have been looking at the PcaHubert() function in rrcov package, which then flags suspected outliers as false:

pca <- PcaHubert(t(rna.data)) 
outliers <- which(pca@flag=='FALSE')

Would this be a good option? Or are there others better suited for RNA-seq data?

Thanks for your input!

RNA-Seq outlier rrcov • 13k views

ADD COMMENT • link updated 3.2 years ago by simplitia ▴ 130 • written 7.1 years ago by JJ ▴ 710

score 14 · Answer 1 · 2017-11-05

14

Entering edit mode

7.1 years ago

Kevin Blighe 88k

People generally inspect for outliers visually by observing the PCA bi-plot for principal components 1 and 2 (see my post here: A: PCA in a RNA seq analysis ). For RNA-seq, a sample that has genuinely 'failed' and whose data is skewed due to extraneous factors unrelated to the biological condition of interest will typically be a magnitude of ~200 to 1 000 from the main group of samples along PC1 - these are very easy to identify and don't usually require statistical justification.

If we do want to quantify what it is to be an outlier (to mis-quote Skakespeare: "To be an outlier, or not to be"), we usually identify any sample that falls outside the main group of samples by a magnitude (along PC1) of greater than 3 standard deviations. Mathematically, all that you need to do is convert your PC1 values to Z-scores and then check for those >|3|. In R, get these by using prcomp() and then accessing the 'x' variable of the returned object, e.g., pca <- prcomp(t(rna.data); pca$x

The method that you've mentioned is published in a reputable journal and therefore justified, in my opinion. I would just ask that you check the following before using it: Does the algorithm expect counts as a negative binomial distribution (e.g. normalized counts in EdgeR or DESeq2) or a normal distribution (logged normalised counts)?

Good luck

Kevin

ADD COMMENT • link 5.6 years ago by Kevin Blighe 88k

1

Entering edit mode

Hello Kevin. I am seeking automatic outlier detection method but couldn't find yet. Cook's distance looks good but maybe more suitable to detect gene outlier not sample. Isolation forest maybe a good way, I need to try first. And your method Maybe not very suitable for patient(clinical) data which PC1's Proportion of Variance may not high for example only around 0.5.

ADD REPLY • link 4.6 years ago by MatthewP ★ 1.4k

0

Entering edit mode

Thank you very much for your input! I used voom transformed RSEM values - so log2CPM. Do you know if the algorithm can work with this? Thanks!

ADD REPLY • link 7.1 years ago by JJ ▴ 710

1

Entering edit mode

Hi friend,

You may want to check the distribution of the data with the hist() function in R, and then share the figure here (or just decide yourself if its normally distributed). To share an image here, just upload here, and then share the URL in your comment/reply.

In the manual for rrcov, which I believe is used by PcaHubert, they state:

These estimates are optimal if the data come from a multivariate normal distribution but are extremely sensitive to the presence of even a few outliers (atypical values, anomalous observations, gross errors) in the data.

[source: https://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf]

So, it looks like having your data normally distributed would be optimal.

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Hello Kevin, Is there any chance you remember which paper used this PC1zscore >|3| method? I would like to read and/or cite it. Thanks!

ADD REPLY • link 5.7 years ago by manninm • 0

0

Entering edit mode

Hey, I do not have any citations - it is just a general way to detect outliers. It would likely only appear in supplementary methods, or not at all. I think that it is okay to justify the removal of outliers by eye, too.

In most statements, people would write: "X samples were removed after visual inspection of a PCA bi-plot"

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Kevin Blighe thank is a great answer. We are having the same issue with outliers. With the 2nd part using > 3 SD; do you have a reference for this I can cite as well? thank in advance.

ADD REPLY • link 3.2 years ago by simplitia ▴ 130

0

Entering edit mode

There is actually, well, a sort of citation off the top of my head. Please see the PLINK documentation (section 'Outlier detection diagnostics'): https://zzz.bwh.harvard.edu/plink/strat.shtml

PLINK is as good a reference as any.

ADD REPLY • link 3.2 years ago by Kevin Blighe 88k

1

Entering edit mode

thanks much appreciated. I also found a good method by just doing a tukey outlier method.

ADD REPLY • link 3.2 years ago by simplitia ▴ 130

score 1 · Answer 2 · 2021-04-07

1

Entering edit mode

3.6 years ago

Melisa ▴ 10

I think in this publication, the outliers removal in that way is justified https://iopscience.iop.org/article/10.1088/1742-6596/705/1/012003/pdf

ADD COMMENT • link 3.6 years ago by Melisa ▴ 10