Question

Forum:Very generally speaking, what are valid criteria to decide that a sample is an outlier?

1

Entering edit mode

4.6 years ago

Aspire ▴ 390

A newbie's quandry:

It seems to me that in order to decide that a sample is an outlier (and not an unusual biological phenomenon), one always has to use some reasons that have to do with QC.

For example, a sample can be decided to be an outlier based upon low alignment percentage, low reads amount, or anything that signifies a technical problem.

If there is no technical problem, then should not really remove a sample from the analysis no matter how different it is from the others.

If there is no technical problem, and no reason to suspect one, then any decision to remove a sample on biological grounds, would be ultimately impossible to distinguish from fitting the data to support your own hypothesis.

Do you agree?

outliers • 3.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 4.6 years ago by Aspire ▴ 390

2

Entering edit mode

Well yes, generally speaking, removing an outlier should always be justified ; and cherry picking data that best fit an hypothesis should be absolutely avoided. That being said, I think there is room for discussion for what is a valid justification for removing outlier. Potential technical issues are not always easily identified. In many cases, one can have reasons to suspect that something went wrong, without knowing what exactly. Perhaps if you could provide a more specific example, we could try to provide more specific answers.

ADD REPLY • link 4.6 years ago by Carlo Yague 9.0k

0

Entering edit mode

Thanks, I want to understand the issue conceptually first; then I will know how to apply it by myself to different cases. The point I was arguing for is that the only way how one can distinguish betwee a justified removal of an outlier, and cherry-picking of the data is if there is some technical problem with the outlier. How to identify the specific technical problem is another issue, though a challenge in itself of course.

Is that wrong? Is there another criteria to distinguish conceptually between justified removal and cherry-picking, that you can suggest?

ADD REPLY • link 4.6 years ago by Aspire ▴ 390

0

Entering edit mode

I agree that the identification of a technical problem is the only way to know for sure that removing an outlier is fully justified. However, I think that suspicion of technical problem based on PCA for instance (to take Istvan example) can also provide sufficient justification for removing outlier. If 9 out of 10 replicates perfectly cluster together while the last one appears completely different, then it most likely reflects the fact that something went wrong with that replicate. After all, it is very common to have samples or whole experiments to fail ; ask a wet lab biologist if you don't believe me.

Some level of judgment is needed of course. To give you a counter-example, if in a PCA two out of three replicates cluster together, then I would not find it justified to remove the third one. The type of data can also affect your interpretation ; it is why I asked for a specific example. ChIP experiments for instance, are infamous for being technically hard to make it work as planned.

ADD REPLY • link 4.6 years ago by Carlo Yague 9.0k

score 3 · Answer 1 · 2020-12-30

Beyond technical quantities that you list clustering and PCA plots may be used to identify both outliers and mislabeled samples.

Here of course the number of replicates matter, the more the better. With four replicates, where otherwise the data is consistent, a mislabeled sample sticks out like a sore thumb.

Several times when I report a problem with a sample my collaborators immediately say, "yeah, that was a weird sample", or "I had a lot trouble extracting that from that one", or "the cells seemed in bad shape" etc. those reasons too are, in my opinion, valid reasons to exclude a sample. Of course would have been better not to get there in the first place.

But as always, the important thing is not whether you are allowed or exclude a sample or not. It is the transparency of it.

Are you clearly stating what you did and why. And not somewhere in the supplementary docs.

Put it out there in the open, state it clearly. Let the results speak or not. It would be unwise to toss out a whole experiment because one sample is an obvious outlier. I remember a case where I was quite certain the labels must have been swapped on two samples (they would cluster perfectly the other way around). In the end we dropped both, though we could have included them.

Long story short when you exclude an outlier the rest of the data needs to be more consistent to carry the burden of proof.