Hi guys, I have a simple question about edgeR and the cpm filtering (i.e. filtering of genes) based on the total number of samples and conditions. The experimental design is the following:
Condition_a: Sa1, Sa2, Sa3 Condition_b: Sb1, Sb2, Sb3 Condition_c: Sc1, Sc2, Sc3 Condition_d: Sd1, Sd2, Sd3 Condition_e: Se1, Se2, Se3 Condition_f: Sf1, Sf2, Sf3 Condition_ctrl: Sctrl1, Sctrl2, Sctrl3, Sctrl4.
briefly, for each condition I have 3 replicates (biological rep) except for the Control where I have 4 replicates
In order to perform differential gene expression my boss asked me to select cpm >1 in at least 2/3 of at least one condition. Since the control has 4 replicates he asked me to consider 3/4 relative to the control. So I'm a little bit confused: about the total number of samples that must satisfy the condition.
keep <- rowSums(cpm(y)>1) >= ?
Can anyone help me please?
Thank you in advance
Ok thank you very much but I'me reading that it requires: min.count: numeric. Minimum count required for at least some samples. My point still remains to determine the number of " at least some samples" based on the design of the experiment I have to analyse.
Yes, but there are default settings for this which I would not change. It is of course ok to have your own settings and being more stringent than what the defaults are, but in this case these parameters come from one of the leading RNA-seq analysis groups with extensive experience in that field. I guess these defaults are well-tested for most applications. The thing is that you also must not overfilter since there are certain assumptions towards the count distributions for edgeR to work properly. Sticking with the defaults is pretty safe unless you have expert knowledge (which at least I do not to that extend).
Yes I understand and agree but I have to do what required explicitly..