I am working with RNAseq data from patients and conditions I have and want to compare to each other for the gene expression analysis, are before and after treatment.
the threshold I used for the read count is 20
. I have 2 questions:
1- we expect to see high expression of some genes after treatment (which were reported in other studies and we also have seen the high expression of those genes) but in this data we did not see them in gene expression results. could it be due to low count of those genes in one or few samples and they are filtered out after applying the read count cut off. what can I do to fix this issue? shall I use 10
as threshold?
2- we know some genes are not expressed in the before treatment samples but their expression is high in after treatment samples. how can I do the DGE
correctly not to lose those genes?
Could you clarify how did you set the threshold (code lines) ?
@Basti: for that I did in excel manually and I used DESeq2 for DGE.
It is not relevant to use excel for such tasks. You'd better import your raw counts matrix into R and apply the filter thereafter. I still do not understand specifically what is the criteria you applied because you said your threshold for read counts was 20. But for how many samples not hitting the threshold did you remove a gene ?
I have 30 samples. 13 samples have less read count than 20. what do you mean by "
what is the criteria
" ? what should be the criteria?In
edgeR::filterByExpr()
for instance, the criteria is to "keep genes that have count-per-million (CPM) above k" (read count=20 in your example) "in n samples, [...] " and "n is essentially the smallest group sample size". As recommended by ATpoint, I would go withfilterByExpr
on your data.