Hello all, I have a quick RNAseq (Quantseq) question for you all!
I am analyzing the Quantseq data for 500 patients and am finding my way through the bioinformatic forest.
Currently I am working on a way to filter out lowly expressed genes, and I am using the Bioconductor package in R to do so. I have thought of a way to do this, but I dont know if I am completely right in doing so and comments are greatly appreciated.
I am planning on filtering lowly expressed genes by CPM, my library sizes range from 1.7M to 7.9M. I want to keep genes that have >10 counts, but I want filter using CPM instead of raw counts as this also corrects for libsize. Can I just use the following formula "raw counts cutoff"/"minimum libsize in millions" = "CPM threshold"? So this would relate to a CPM filter value of 10/1.7 = 5.9?
As I am thinking of this, this seems to me as there is a lot of 'data' wasted, as there a bigger libraries in the dataset, but only the smallest library determines the cutoff for filtering. This means that in some libraries genes are discarded that have a count of >10. Would using another cut-off, for example the mean library size not be a better cutoff?
After this I want to keep genes with a CPM >5.9, in the manual from the EdgeR package they select if a CPM value is available in 2 or more rows as they use 2 biological replicates. As I dont have any biological replicates can I just select the genes with a CPM of 5.9 in any of the samples?
Any guidance through the forest would be greatly appreciated!
Benformatics, thanks for your reply! Makes sense just to filter on raw counts as I am using QuantSeq.
Just tried both approaches and it one I keep 14561 genes in the other 11785. As the extra kept genes in the first approach are probably relatively lowly expressed, I think this would not make a really big difference in the DE analysis.
Anyway, many thanks for your help, it is greatly appreciated!