Entering edit mode
3.6 years ago
Will
▴
20
Hi, I have a DGEList object created used edgeR. The dimension of this object is: 57820 - 1013. I have to choose the filtering and I am not sure that my choice is completely correct. The norm factor in x$samples are all 1 and the summary(x$samples$lib.size)
is:
Min. 1st Qu. Median Mean 3rd Qu. Max.
6557050 31326322 36019156 35935285 40766618 79411964
I tried with keep.exprs <- rowSums(cpm(x)>0.4) >= 5
and keep.exprs <- filterByExpr(x)
. When I run x_filtered <- x[keep.exprs,]
with the first one the total dimension becomes 52082 - 1013 while with the second one 24045 - 1013.
Which is the best filtering and why ?
filterByExpr
is preferred as it filters based on group information and not arbitrarily on>=
some integer value. You have 1013 samples? Be sure that your DGEList has a proper group information for the filter to be meaningful.Looks like there are some arbitrary integers set in the
filterByExpr
callmin.count=10
andmin.total.count=15
I was referring to the filtering using the group information rather than setting a random value (here the 5 samples that need to have cpm above 0.4). These five samples could be randomly distributed over multiple groups but each group could still lack the power for that gene to be called significant, that is why I think that fBE is preferred. After all the aim of the filtering is to remove genes that inflate multiple-testing burden, therefore a strategy that respects the size of the groups makes sense to me. The thresholds you mention are probably debatable, I agree.
Thanks! So you suggest to use
keep.exprs <- filterByExpr(x, group=x$sample$group)
, where group in my case is the condition (Healthy or not) of my subjects?