Question

Filtering the genes based on cpm in edgeR

2

Entering edit mode

5.0 years ago

elb ▴ 260

Hi guys, I have a simple question about edgeR and the cpm filtering (i.e. filtering of genes) based on the total number of samples and conditions. The experimental design is the following:

  Condition_a: Sa1, Sa2, Sa3
  Condition_b: Sb1, Sb2, Sb3
  Condition_c: Sc1, Sc2, Sc3
  Condition_d: Sd1, Sd2, Sd3
  Condition_e: Se1, Se2, Se3
  Condition_f: Sf1, Sf2, Sf3
  Condition_ctrl: Sctrl1, Sctrl2, Sctrl3, Sctrl4.

briefly, for each condition I have 3 replicates (biological rep) except for the Control where I have 4 replicates

In order to perform differential gene expression my boss asked me to select cpm >1 in at least 2/3 of at least one condition. Since the control has 4 replicates he asked me to consider 3/4 relative to the control. So I'm a little bit confused: about the total number of samples that must satisfy the condition.

   keep <- rowSums(cpm(y)>1) >= ?

Can anyone help me please?

Thank you in advance

RNA-Seq edgeR • 4.8k views

ADD COMMENT • link 5.0 years ago by elb ▴ 260

score 0 · Answer 1 · 2020-04-06

0

Entering edit mode

5.0 years ago

ATpoint 87k

edgeR has a dedicated function for this kind of filtering called filterByExpr which I strongly recommend since 1) it does not require arbitrary thresholding and 2) is recommended by the authors. Please check the manual for it. I personally find it best (as in the manual) to first create your DGEList object, then put the design into that DGEList and then run the filter function. It is important that the DGEList contains the design since the filtering function needs it to know which samples belong to which group.

In short, the filterByExpr will remove genes with constantly low counts which have no chance of being called differential with the given design and number of replicates / sequencing depth. This then helps to decrease the multiple testing burden 8maybe it also helps with dispersion estimation and normalization, I do not recall, but this is described in the manual which also references the underlying paper).

ADD COMMENT • link 5.0 years ago by ATpoint 87k

0

Entering edit mode

Ok thank you very much but I'me reading that it requires: min.count: numeric. Minimum count required for at least some samples. My point still remains to determine the number of " at least some samples" based on the design of the experiment I have to analyse.

ADD REPLY • link 5.0 years ago by elb ▴ 260

0

Entering edit mode

Yes, but there are default settings for this which I would not change. It is of course ok to have your own settings and being more stringent than what the defaults are, but in this case these parameters come from one of the leading RNA-seq analysis groups with extensive experience in that field. I guess these defaults are well-tested for most applications. The thing is that you also must not overfilter since there are certain assumptions towards the count distributions for edgeR to work properly. Sticking with the defaults is pretty safe unless you have expert knowledge (which at least I do not to that extend).