As far as I know, there isn't a universally agreed-upon threshold or an approach to clean the data. I want to remove the genes that don't contribute, or in other words, the noise genes, BEFORE I normalize the data, using CPM or TPM or any other approach.
I've picked the threshold randomly, I tried not to set it too high so that I dont delete important genes that might have infomative value. This is my code:
thresh = data > 0.5
keep = rowSums(thresh) >= 1.5
data = data[keep,]
What do you think? thanks!