I estimated DEGs from RNASeq using edgeR method. But the final DEG list has several genes with low counts and are having high Fold change
logFC logCPM F PValue FDR Ctrl1 Ctrl2 Str1 Str2
Gene1 7.673 -2.658 33.203 1.27E-05 0.000118 0 0 0.738954 0.657007
Gene5 7.605 -2.951 26.996 3.36E-05 0.000255 0 0 0.674697 0.657007
Gene35 7.499 -2.072 22.724 8.47E-05 0.0034 0 0 0.642569 0.594435
Gene67 7.462 -2.151 36.849 2.49E-06 3.32E 0 0 0.642569 0.563149
Initialy, I filtered out low count genes using
keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]
But I am not sure why genes normalized count value close to zero are retained and listed as DEGs. How can I solve this? I was thinking to use min.count parameter in filterByExpr.
But, is that the right way? Suppose if I use, keep<-filterByExpr(y,group = group, min.count=500)
, the genes with greater than 1 normalized counts are retained.
To be more specific, where should I apply filterByExpr
function, before normaliztion (ie on raw counts) or after normaliztion ?
Code:
group<-factor(paste(targets$Genotype,targets$Time,targets$Treatment,sep="."))
cbind(targets,Group=group)
y<-DGEList(counts = raw_counts, group = group)
#Filterout low count genes
keep<-filterByExpr(y,group = group)
y<-y[keep,keep.lib.size=FALSE]
#Data Normalization
y<-calcNormFactors(y)
#Design matrix
design<-model.matrix(~0+group)
colnames(design)<-levels(group)
#Dsipersion estimation
y<-estimateDisp(y,design)
fit<-glmQLFit(y,design)
Dear all,
As far as I know, CPM threshold 0.5 (corresponded to 10 counts) is recommended for a 20 million reads library, isn't it? Could you kindly clear me which CPM threshold will be used for filtering by
filterByExpr
function?See the help page
help(filterByExpr)
.