Forum:Very generally speaking, what are valid criteria to decide that a sample is an outlier?
1
1
Entering edit mode
4.0 years ago
Aspire ▴ 370

A newbie's quandry:

It seems to me that in order to decide that a sample is an outlier (and not an unusual biological phenomenon), one always has to use some reasons that have to do with QC.

For example, a sample can be decided to be an outlier based upon low alignment percentage, low reads amount, or anything that signifies a technical problem.

If there is no technical problem, then should not really remove a sample from the analysis no matter how different it is from the others.

If there is no technical problem, and no reason to suspect one, then any decision to remove a sample on biological grounds, would be ultimately impossible to distinguish from fitting the data to support your own hypothesis.

Do you agree?

outliers • 2.5k views
ADD COMMENT
2
Entering edit mode

Well yes, generally speaking, removing an outlier should always be justified ; and cherry picking data that best fit an hypothesis should be absolutely avoided. That being said, I think there is room for discussion for what is a valid justification for removing outlier. Potential technical issues are not always easily identified. In many cases, one can have reasons to suspect that something went wrong, without knowing what exactly. Perhaps if you could provide a more specific example, we could try to provide more specific answers.

ADD REPLY
0
Entering edit mode

Thanks, I want to understand the issue conceptually first; then I will know how to apply it by myself to different cases. The point I was arguing for is that the only way how one can distinguish betwee a justified removal of an outlier, and cherry-picking of the data is if there is some technical problem with the outlier. How to identify the specific technical problem is another issue, though a challenge in itself of course.

Is that wrong? Is there another criteria to distinguish conceptually between justified removal and cherry-picking, that you can suggest?

ADD REPLY
0
Entering edit mode

I agree that the identification of a technical problem is the only way to know for sure that removing an outlier is fully justified. However, I think that suspicion of technical problem based on PCA for instance (to take Istvan example) can also provide sufficient justification for removing outlier. If 9 out of 10 replicates perfectly cluster together while the last one appears completely different, then it most likely reflects the fact that something went wrong with that replicate. After all, it is very common to have samples or whole experiments to fail ; ask a wet lab biologist if you don't believe me.

Some level of judgment is needed of course. To give you a counter-example, if in a PCA two out of three replicates cluster together, then I would not find it justified to remove the third one. The type of data can also affect your interpretation ; it is why I asked for a specific example. ChIP experiments for instance, are infamous for being technically hard to make it work as planned.

ADD REPLY
3
Entering edit mode
4.0 years ago

Beyond technical quantities that you list clustering and PCA plots may be used to identify both outliers and mislabeled samples.

Here of course the number of replicates matter, the more the better. With four replicates, where otherwise the data is consistent, a mislabeled sample sticks out like a sore thumb.

Several times when I report a problem with a sample my collaborators immediately say, "yeah, that was a weird sample", or "I had a lot trouble extracting that from that one", or "the cells seemed in bad shape" etc. those reasons too are, in my opinion, valid reasons to exclude a sample. Of course would have been better not to get there in the first place.

But as always, the important thing is not whether you are allowed or exclude a sample or not. It is the transparency of it.

Are you clearly stating what you did and why. And not somewhere in the supplementary docs.

Put it out there in the open, state it clearly. Let the results speak or not. It would be unwise to toss out a whole experiment because one sample is an obvious outlier. I remember a case where I was quite certain the labels must have been swapped on two samples (they would cluster perfectly the other way around). In the end we dropped both, though we could have included them.

Long story short when you exclude an outlier the rest of the data needs to be more consistent to carry the burden of proof.

ADD COMMENT
0
Entering edit mode

The issue with clustering and PCA plots is that by themselves they do not provide any way to distinguish between technical and biological variation between samples.

They can be very useful to provide hints and clues to distinguish between technical and biological variation, but they cannot distinguish between that on themselves.

Take that example you gave where the labels were possibly swapped. If there is no additional evidence from the data that they were swapped (such as specific genes being expressed in a condition where they have been knocked out), I think it would not fair to declare that they were indeed swapped based on the clustering.

The reasons are that

  1. Based on the clustering alone, it's impossible to know whether there was really a sample swap or an unusual biological phenomenon has occured.

  2. Ultimately, one can't distinguish between declaring that the samples were probably swapped because the data makes more sense when this is assumed, and declaring that the samplse were probably swapped because the data fits better the hypothesis of the researchers.

ADD REPLY

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6