How does edgeR's filterByExpr work?
2
0
Entering edit mode
8 weeks ago
Haad • 0

I was reading the edgeR 4.0 paper but all it says in the paper is that "edgeR’s filterByExpr function is used to keep only those genes or features that have sufficient counts to achieve statistical significance when meaning-ful differential abundance is present." But how is done exactly, because in the edgeR user guide (page 11) they just write keep <- filterByExpr(y) even though in the reference manual it is written that the minimum count is required. Can someone explains how this function works?

edgeR • 735 views
ADD COMMENT
1
Entering edit mode

Type ?filterByExpr and read the details section which covers what the function does.

ADD REPLY
5
Entering edit mode
8 weeks ago
Gordon Smyth ★ 7.7k

Thanks for reading our edgeR 4.0 paper, and also for reading the User's Guide and the reference manual!

I think you're not reading the meaning of the help page and reference manual quite correctly. First, let me note that the reference manual available from https://www.bioconductor.org/packages/release/bioc/manuals/edgeR/man/edgeR.pdf is simply a pdf collation of the help pages for all the functions in edgeR. It gives exactly the same information that you would get from typing ?filterByExpr at the R prompt, which gives the help page just for that function.

The filterByExpr help page gives the following usage line:

filterByExpr(y, design = NULL, group = NULL, lib.size = NULL,
             min.count = 10, min.total.count = 15, large.n = 10, min.prop = 0.7, ...)

which shows the default value of each argument, for example the default value of min.count is 10. Arguments that are NULL in the argument definition have defaults that depend on the data, and which are explained in the documentation details.

You can see from the above usage line that only the y argument is compulsory in the function call because it is the only argument that doesn't have a default value. (This is the same documentation convention that is used by all the base packages in R. The help pages for all the base functions in R can be read this way.) Users will usually specify either design or group as well, but the other arguments are usually left at their defaults.

You say in your question that "in the reference manual it is written that the minimum count is required" but the word "required" in the min.count documentation simply refers to the fact that rows of the count matrix are required to satisfy this minimum, it doesn't mean that min.count is a required (compulsory) argument in the function call.

In the "quick start" example on page 11 of the edgeR User's Guide, y is a DGEList. In this case, the group argument is read from the DGEList as group <- y$samples$group and all the other filterByExpr arguments are set to their defaults. So in this case the call

keep <- filterByExpr(y)

is exactly equivalent to

keep <- filterByExpr(y, group=group, min.count=10)

I see that we have not explained on the help page how filterByExpr takes information from a DGEList object, so my apologies for that. By default, filterByExpr extracts the library sizes and the experimental design from the DGEList.

ADD COMMENT
1
Entering edit mode
8 weeks ago
Basti ★ 2.0k

You will find the explanation of how the function works here : https://rdrr.io/bioc/edgeR/man/filterByExpr.html

This function implements the filtering strategy that was intuitively described by Chen et al (2016). Roughly speaking, the strategy keeps genes that have at least min.count reads in a worthwhile number samples. More precisely, the filtering keeps genes that have count-per-million (CPM) above k in n samples, where k is determined by min.count and by the sample library sizes and n is determined by the design matrix.

n is essentially the smallest group sample size or, more generally, the minimum inverse leverage of any fitted value. If all the group sizes are larger than large.n, then this is relaxed slightly, but with n always greater than min.prop of the smallest group size (70% by default).

ADD COMMENT
0
Entering edit mode

What if min.count is not provided? In the example in the documentation, min count is not provided:

## Not run: 
keep <- filterByExpr(y, design)
y <- y[keep,]

## End(Not run)
ADD REPLY
1
Entering edit mode

See my answer below.

Also be aware that the edgeR documentation on rdrr.io is nearly four years old. To get the latest documentation, use the help that comes with edgeR as advised by ATpoint. Or alternatively, the documentation for all edgeR functions is available as a pdf here, which I think you are already reading: https://www.bioconductor.org/packages/release/bioc/manuals/edgeR/man/edgeR.pdf

The current filterByExpr document gives explicit mathematical formulas to define how the function works.

ADD REPLY

Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6