I am trying to find differentially expressed genes in microarray data. I preferred to do this using excel because I did not know how to do it in R (performing large number of t-tests simultaneously). After performing t-test in excel and obtaining p-values, I try t-test for a few number of rows in R, but the results of R and Excel are not the same.
Other answers contain references for my first sentence. It sounds like a joke, but it is true most of the time, I should explain that maybe a bit better. Why to trust R over Excel?
R is open source, Excel is closed source, you can easily inspect the code for the t.test in R if you have doubts, while in Excel you cannot
R uses open-source libraries, these provide algorithms with very well-understood numerical properties, Excel uses whatever
R's build process contains automated tests, we do not know which test model Excel developers use
R's main field of application is statistics, and possibly t.test is one of the most used functions, while Excel is used for all kinds of things and t-test and statistics is a niche application.
If yes did you check the parameters each t-test uses? In R and Excel the standard parameters may differ, which causes different results.
In R some parameters you can set are for example:
alternative a character string specifying the alternative hypothesis,
must be one of "two.sided" (default), "greater" or
"less". You can specify just the initial letter.
paired a logical indicating whether you want a paired t-test.
var.equal a logical variable indicating whether to treat the two
variances as being equal. If TRUE then the pooled variance
is used to estimate the variance otherwise the Welch (or
Satterthwaite) approximation to the degrees of freedom is
used.
ADD COMMENT
• link
updated 2.7 years ago by
Ram
44k
•
written 9.9 years ago by
linus
▴
360
0
Entering edit mode
Thanks a lot for your comment
When I added var.equal=TRUE in R in t.test function, the results became the same as excel.
You should also read a few papers that discuss problems with t-tests and other statistics in Excel, for example Heiser 2006. Statistical tests, tests of significance, and tests of a hypothesis using Excel. Journal of Applied Statistical Methods, 5:155-171 or McCullough and Heiser 2008. On the accuracy of statistical procedures in Microsoft Excel 2007. Computational Statistics & Data Analysis 52:4570-4578. Here's a snippet from the second paper listing a few problems...
Inaccurate t-test results in the presence of missing values. Does not correctly calculate t test values when there are missing data cells in the range. Microsoft identified this fault (KBA 829252) in previous Excel versions, but did not fix it.
Inaccurate p-values from a t-test. Excel uses the Welch test method, which calculates a non-integer degrees-of-freedom value. The Excel algorithm uses the TTEST function, which truncates the degrees-of-freedom input to an integer. This gives an incorrect p value for the test.
You say that you tested a few rows in R, suggesting that the input data for the t-test in R is not the same as the input for the t-test in Excel?
Another source of difference is that R will handle floating point numbers differently, which will influence the outcome of your statistic. This is not a negative against R, however; R is better.
this is an interesting point actually - though personally I think the original poster simply is not using the test the same way.
back to the issue - internally Excel seems to use a much larger precision library to compute the probability distributions, and as consequence it will compute very tiny p-values well beyond what actually makes sense. This is reserved strictly to p-values for all other computation the representation double precision.
This can be extremely annoying when communicating with some life scientists as they hate the idea of seeing 0 for p-value and much prefer something like 10^(-100) as that looks more scientific. Educating someone on the spot in the intricacies of floating point representation is hopeless.
Anyway just something popped into my mind - that feeling of having to do something in Excel because it is "more accurate" - thankfully I can mostly refuse doing that
Other answers contain references for my first sentence. It sounds like a joke, but it is true most of the time, I should explain that maybe a bit better. Why to trust R over Excel?