I'm creating a wrapper around edgeR's exactTest. I noticed (and understand why) the results different for when I calculate dispersion using the entire dataset and when just using the 2 pairs I'm analyzing. My question is whether or not one option is preferred. My intuition is telling me that it's better to calculate dispersion using the entire dataset, even if I'm only going to be looking at conditions separately.
In this terrible example (but for sake of simplicity), I'm using the iris dataset. There are 150 samples and 4 "genes" (['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
) with 3 "conditions" (['setosa', 'versicolor', 'virginica'])
. I'm treating setosa
as my "reference" condition so everything will be in relation to that.
If I calculate dispersion for each pair individually (i.e., setosa vs. versicolor, setosa vs. virginica). I get the following output from the exactTest
:
If I calculate dispersion for the entire dataset first then I get this:
What is preferred by the bioinformatics community?
What are the pros and cons of using one way over another?