I want to identify genes, whose mean expression may be less representative due to high variability or outliers.
Specifically, I am working with single-cell data that is pre-processed (normalized and clustered to cell types). For my downstream analysis, I need to use the mean expression per gene and cluster.
However, the question arose if outliers could skew my mean and impact my downstream analysis. As I need to use the mean, the idea is to exclude specific genes in specific clusters, if outliers might be skewing the mean.
My brain is in a bit of a knot how to go about this. I've tried / considered different things, but can't think of a good systematic approach to go about this. I would greatly appreciate feedback and tips.
- Idea 1: For each cluster and each gene, check if a Boxplot shows outliers (Quartile + interquartile range*1.5). This may be too conservative though, as one outlier in 200 single cells would barely impact the mean.
- Idea 2: Check the mean and the standard deviation of each gene per cluster. If a gene has a standard deviation above a certain threshold (e.g. mean * 1.6), exclude that gene. Unsure if this is a good approach and which threshold would be recommendable.
- Idea 3: Check the variability of all genes per cluster, and identify outliers at that level (e.g. with the boxplot-approach - if the variance of a gene is greater than quartile + iqr*1.5 of the variances of all genes in my data, exclude it). unsure if this is a good approach and if it may be biased due to biologically differing variability of gene expression.
I feel like there must be an established approach for this, but I don't know it. My personal tendency at the moment is idea 2 ...
I'd be very grateful for any tips or feedback.