Question

losing many genes while doing DGE on the RNAseq data due to read count threshold

0

Entering edit mode

17 months ago

Sara ▴ 270

I am working with RNAseq data from patients and conditions I have and want to compare to each other for the gene expression analysis, are before and after treatment. the threshold I used for the read count is 20. I have 2 questions:

1- we expect to see high expression of some genes after treatment (which were reported in other studies and we also have seen the high expression of those genes) but in this data we did not see them in gene expression results. could it be due to low count of those genes in one or few samples and they are filtered out after applying the read count cut off. what can I do to fix this issue? shall I use 10 as threshold?

2- we know some genes are not expressed in the before treatment samples but their expression is high in after treatment samples. how can I do the DGE correctly not to lose those genes?

DGE RNAseq • 1.7k views

ADD COMMENT • link updated 17 months ago by i.sudbery 20k • written 17 months ago by Sara ▴ 270

0

Entering edit mode

Could you clarify how did you set the threshold (code lines) ?

ADD REPLY • link 17 months ago by Basti ★ 2.0k

0

Entering edit mode

@Basti: for that I did in excel manually and I used DESeq2 for DGE.

ADD REPLY • link 17 months ago by Sara ▴ 270

2

Entering edit mode

It is not relevant to use excel for such tasks. You'd better import your raw counts matrix into R and apply the filter thereafter. I still do not understand specifically what is the criteria you applied because you said your threshold for read counts was 20. But for how many samples not hitting the threshold did you remove a gene ?

ADD REPLY • link 17 months ago by Basti ★ 2.0k

0

Entering edit mode

I have 30 samples. 13 samples have less read count than 20. what do you mean by "what is the criteria " ? what should be the criteria?

ADD REPLY • link 17 months ago by Sara ▴ 270

0

Entering edit mode

In edgeR::filterByExpr() for instance, the criteria is to "keep genes that have count-per-million (CPM) above k" (read count=20 in your example) "in n samples, [...] " and "n is essentially the smallest group sample size". As recommended by ATpoint, I would go with filterByExpron your data.

ADD REPLY • link 17 months ago by Basti ★ 2.0k

score 2 · Answer 1 · 2023-07-21

My advise is to use the function edgeR::filterByExpr() which is automated with reasonable defaults for prefiltering prior to DE. Don't ever do custom things in Excel, this is error-prone and not reproducible.

Load the count matrix into R, filter with this function, and then apply the normal DESeq2 workflow as in the vignette.

Basically the keep line here http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering would be replaced by

keep <- edgeR::filterByExpr(counts(dds), group=dds$group)

where group would be the column that has the before/after treatment information. In most cases, very very roughly, you retain about 15,000-25,000 genes after such a filter.

score 1 · Answer 2 · 2023-07-21

1

Entering edit mode

17 months ago

i.sudbery 20k

If you are using DESeq2, there is no need to do any filtering by expression before doing DGE, because DESeq2 will automatically choose the expression threshold that provides the largest number of DE genes by itself. This independent filtering is controlled by the two parameters to the results function in DESeq - independentFiltering and alpha. independentFiltering controls whether to do filtering at all. The default is TRUE. alpha is the FDR threshold you intend to use for selecting differential genes. The default is 0.1, but most people I know use 0.05. Which ever you use, DESeq will try a range of different expression thresholds, and select the one that produces the largest number of differential genes at the alpha threshold you selected.

More detail here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilt

and

here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilttheory

ADD COMMENT • link 17 months ago by i.sudbery 20k

0

Entering edit mode

I personally would always prefilter, to avoid skewed estimates of logFCs and pvalues due to small counts, like very large FCs for some genes that are all-zero in one condition and then have lots of zeros plus one or two large outliers in the other group. This is especially true if n is large. After all, prefiltering and independent filtering are two different concepts that are not mutually exclusive.

ADD REPLY • link 17 months ago by ATpoint 86k

0

Entering edit mode

Outliers shouldn't skew the p-value in DESeq. For a start, there is outlier detection, which removes any replicate that unduely affects the model fit for a given gene using cooks distrance.

But even absent that, because we are fitting a negative binomial model, which should explicitly model the probability of high count outliers, and so under normal circumstances, you wouldn't expect a high count outlier against a 0 count background to lead to a significant p-value, the evidence really was strong enough to say that the mean count was higher than 0 in that condition.

For logfoldchanges, using a count threshold doesn't prevent the mean counts having an effect of the distribution of logfoldchanges, you just get rid of the most extreme examples, but that just makes the effect more insidious and easier to miss. Better to either shrink the lfc estimates using lfcShrink, use a non-default alternative hypothesis, via the lfcThreshold option to results, that asks that we are confident the lfc is above a certain threshold before marking a gene as significant, or at least take notice of the standard error estimates for lfc that DESeq provides.

No harm in pre-filtering of course, as long as its fairly light-touch. I just don't like using things that have arbitrary thresholds, particularly when the meaning of, say, 10 reads, differs sample to sample and experiment to experiment.

ADD REPLY • link 17 months ago by i.sudbery 20k