Entering edit mode
4.9 years ago
mariannapauletto
▴
90
Hi guys,
I'm dealing with DE genes resulted from edgeR analysis comparing two experimental groups (n=3 per group) I noticed that some DEGs show high variability within replicates... Do you think that the program takes into account this aspect? Are these DEGs reliable or not? If not, how to filter the edgeR output for being more reliable? Is there a specific code in edgeR to cope with this issue?
Thanks for you help!
Best Marianna
Yes, this is a key feature of (basically any) statistical approach.
Can you give a concrete example including the expression values and p-values of such a gene? Even if dispersion is higher for certain genes but relative expression is decent and fold changes are large it still can be statistically significant.
You can always decrease FDR cutoff to be more conservative, but at the cost of false-negatives.
Please also share the code you used. Did you use
filterByExpr
?sorry I dd a mess with the comments, see the reply below
Hi ATpoint,
thank you for your quick reply.
I'm speaking about the results of an analysis conducted in the past with an edgeR version lacking the filterByExpr command. Anyway I filtered data with the approach suggested by the manual of that version (see below)
This is the code:
This is an example of a DEG with a high sd (values are expressed in CPM) group 1: 0.256892 0.321829 0.06487 | group 2: 5.078998 0.367778 1.278966 logFC:3.28; logCPM: -0.088; LR:12.41; p-val: 0.00042; FDR: 0.075
Since I typically filter for FDR < 0.05 it would not significant in my eyes. The logCPM is also quite low. I would probably not trust it. If you still have the raw counts I would plug it into the current edgeR version, use filterbyexpr and also the glmQLF framework which is (from what I understand) what the developers currently recommend as the default approach. If you google glmLRT vs glmQLF you will find some posts at Bioconductor where they explain why they think it is superior oin most cases.
Thank you for your reply.
That's clear. Actually, with recent datasets I've implemented exactly the same approach you mentioned (filterByExpr and glmQLF).
But: in your opinion, looking at the results obtained with the previous version and the code I used (including filtering by low expression CPM< 0.5 --> about 10-15 reads), is there any reason to conclude that significant genes (FC >1.5, FDR < 10%) are not reliable?
Yes I get the point of the FDR cutoff, but this only increases the probability of having false positives and this does not question the reliability of a single DEG. Isn't it?
Thank you very much for this fruitful discussion!
Best
Marianna
mariannapauletto : Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.Thank you,
sorry for that