If you want clarity about and control over how you deal with outliers, without dealing with the "blackbox" that other people's code provides, there are a couple common approaches you can use to do this yourself:
- Trimming
- Winsorization
In both approaches, you specify a percentage cutoff. You sort the data from low to high, and you take some percentage of values from the full set of data and deal with them, depending on the method.
For instance, if you have 1000 points, with a 10% cutoff, the values you deal with are the top 500 and bottom 500 values. Each of these subsets makes up 5% of the total dataset — or 10%, in total.
With the trimming method, any value from your dataset which falls in this cutoff is removed. If you start with 1000 values and have a 10% cutoff, you end up with a dataset containing 900 values.
With the Winsorization approach, unlike trimming, any value which falls in this cutoff is not removed, but is instead replaced with the next lowest or highest value. You still end up with 1000 values.
Both approaches change your distribution, but they filter outliers.
How many outliers are removed depends entirely on your choice of cutoff.
In R, you could trim simply by excising a number of rows from a dataset that meet the criteria (e.g., using a 10% cutoff):
q <- quantile(x, probs = c(5, 95)/100)
trimmed_x <- x[x>q[1] & x<q[2]]
In R, you could Winsor with the winsorize
function from statar
(e.g., using a 10% cutoff):
winsorized_x <- winsorize(x, probs = c(0.05, 0.95))
Then plot trimmed_x
or winsorized_x
, as if it was your original dataset.
outlier.shape = NA
for ggplot to hide outliers. Try changingnotch.width
values.Instead of box plots, try beeswarm or violin plots with jitter.