Question

Why use the qvalue package instead of p.adjust in R stats?

14

Entering edit mode

9.8 years ago

rmccloskey ▴ 240

A lot of papers I've read have used the qvalue package from Bioconductor to do multiple testing correction. There is a function in the basic R stats package called p.adjust which seems to do the same thing. What is the advantage of the qvalue package, and why is it so frequently used instead of the built in R version?

qvalue R • 24k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by rmccloskey ▴ 240

0

Entering edit mode

9.8 years ago

Jean-Karim Heriche 27k

The qvalue package implements FDR methods taylored for genomics studies. The p.adjust function implements "standard" FDR methods. The qvalue package also has a plotting function and a GUI.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I know I'm a bit late to the party, but I wanted to point out that the qvalue package is not necessarily tailored to genomic studies. The qvalue algorithm has underlying assumptions that only allow its applicability when there is no dependence in the data, and with a specific p-value distribution. IF there is dependence in the data, and a "U-shaped" p-value distribution, then qvalue is not applicable (according to the q-value package vignette). I'm honestly not sure if dependence affects the BH method of p.adjust.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by reilly.brian.m ▴ 70

0

Entering edit mode

Thanks for the clarification. My understanding is that the BH procedure still controls the FDR when the tests statistics are positively correlated. The BY modification can be applied for other cases but is more conservative.
Edit: I associated qvalue with genomics studies from this paper.

ADD REPLY • link 8.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Yup that paper is the same paper that I associate the q-value with as well. The vignette I was speaking of was from the R/Bioconductor qvalue package, which I think was developed from the authors of the PNAS paper. Here is the vignette, specifically section 5.2 talks about the applicability of the method with different pvalue distributions.

I have also read that the BH procedure is applicable in positive dependence, and that supposedly the BHY method is applicable in both positive and negative dependence. My main issue was understanding what positive/negative correlation/dependence looked like, and how to tell when you have those scenarios.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by reilly.brian.m ▴ 70

1

Entering edit mode

The property that must be satisfied by the test statistics (if they are not independent) in order for the BH procedure to be guaranteed to work is called positive regression dependence on a subset (abbreviated as PRDS). As I understand it, this means that given a series of tests, the probability of a test corresponding to a true null hypothesis doesn't decrease as the p-values increase. Or put another way, when you order the p-values of your tests from lowest to highest, the probability of the null hypothesis being true should not decrease. This means you should know how the distributions of your test statistics depend on the null hypotheses before applying the BH procedure. In general we don't know if the PRDS condition is satisfied. However, my initial understanding was that the PRDS condition is satisfied if the test statistics are positively correlated to each other but this has apparently only been proven for Gaussian distributed one-sided test statistics. It seems that in all other cases, one should be applying the BY procedure. So strictly speaking, before using the BH procedure, one should justify that either the test statistics are independent or that the PRDS condition is satisfied. My suspicion is that in practice the BH procedure is robust enough to departure from PRDS to still be controlling the FDR adequately.

ADD REPLY • link 8.8 years ago by Jean-Karim Heriche 27k

Ram · Accepted Answer · 2015-01-30

21

Entering edit mode

9.8 years ago

Devon Ryan 104k

p.adjust() and the qvalue package aren't actually doing exactly the same thing. They are doing quite similar things with similar ends in mind, but the algorithms are different, so they produce different results (Gordon Smyth, who wrote p.adjust(), has a short summary of the history here and Tim Triche gives a nice explanation of the differences in the same thread). Interestingly, the packages I use almost always use p.adjust() rather than the qvalue package, though if you read a paper that uses them the authors might actually say "q-value" rather than "BH adjusted p-value".

BTW, q-values are a bit more difficult to calculate and you actually need quite a few data points. In effect, you have to use the data at hand to estimate the expected null rate and then multiply that by the BH adjusted p-value. Consequently, p.adjust() ends up being more generally applicable, but in cases where q-values are appropriate they tend to give more power.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the great explanation!

ADD REPLY • link 9.8 years ago by rmccloskey ▴ 240

0

Entering edit mode

Hello -- The link under the phrase "the same thread" points to https://stat.ethz.ch/pipermail/bioconductor/attachments/20121219/00dc27b1/attachment.pl which is a perl file. What link did you intend? Thanks!

Something to add: When I went to the original post, below it is the following statement:

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: https://stat.ethz.ch/pipermail/bioconductor/attachments/20121219/00dc27b1/attachment.pl

But I'd love to read more explanations of the differences!

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 7.6 years ago by ariel.balter ▴ 260

1

Entering edit mode

I was able to look at that weird attachment. For posterity, here is Tim Triche's reply:

p-value = extremal probability for a test statistic under the null hypothesis, not accounting for multiple comparisons BH p-value, pBH = extremal probability for the same, after accounting for multiple comparisons to upper-bound the overall false positive rate at <= p q-value = direct estimate of the FDR associated with pBH

see http://genomics.princeton.edu/storeylab/papers/directfdr.pdf for the original, and quite well written paper, where on page 485,

The basic point that we make is that using the Benjamini and Hochberg (1995) method to control FDR at level α/π0 is equivalent to (i.e. rejects the same p-values as) using the proposed method to control FDR at level α. The gain in power from our approach is clear--we control a smaller error rate (α <= α/π0), yet reject the same number of tests.

q-values depend also on the estimated fraction of test p-values in the chance or uniform component of the distribution at some pFDR p. pi0 = estimated probability (overall) of a given result being truly null (i.e., false positive) at p | FDR q - value = BH p-value * pi0 (probability that test t incorrectly rejects the null at pBH)

So q = pBH * pi0 (++) as can be verified from the output, and directly estimates the pFDR for test t assuming independence among the tests. The mathematical justification for this is given in the paper; the basic machinery can be, and has been, extended to many other situations.

(++) If pi0 is estimated as at or very near 1.0, then pBH and q will be the same for any given test t, to the limit of machine precision (see paper).

At least that's how it appears to be implemented last time I looked at the code and the paper :-)

ADD REPLY • link 5.5 years ago by PS ▴ 10