Hello,
Which is the most wise way to filter out low expressed genes (TPM) from a RNA-seq dataset ? Ive seen some empirically based methods that did not totally convince me.
Whats your opinion about it? thanks in advance
Hello,
Which is the most wise way to filter out low expressed genes (TPM) from a RNA-seq dataset ? Ive seen some empirically based methods that did not totally convince me.
Whats your opinion about it? thanks in advance
I depends on what your downstream analysis is. If your aim is to filter low expressed genes to increase power in a differential expression analysis, I recommend reading
Data-driven hypothesis weighting increases detection power in genome-scale multiple testing
If you want to divide genes into expressed and non-expressed for a biological reason, there there really isn't a good way to do it. I rule of thumb might be:
There are about 200,000 transcript molecules in a cell at anyone time (very approximate, order of magnitude type estimate). THus a TPM of 5 represents about 1 transcript per cell (average).
If you are interested if read counts is above background noise (e.g. perhaps they are contaminating DNA molecules in your library preps), you could use the method described here.
I guess the worry here is that lowly expressed genes have more noise and thus will screw up the correlations. I'm not sure anyone has every really considered this question, nor can I think of a principled approach.
The key thing with correlations is probably to get rid of the zeros. Too many zeros can cause a real problem. Other than that, you probably want to keep some pretty lowly expressed genes: you can't have a correlation if you only keep high expressed genes. You could use the simulate and local FDR method I linked to above, but I'm guess its not worth the bother. Your results are unlikely to be signficantly different to if you had just used a 1 TPM type threshold. Remember, the aim of bioinformatics is to extract biologically meaningful results, but be mathematically 100% correct. If a trend in your data is strong enough to be biologically meaningful, its probably strong enough to be insensitive to a range of sensible expression thresholds.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I have tried TPM, RPKM both. There was bias wrt gene size. I wonder if more filters are required.
Can you be more specific?
1 fpkm is a standard filter.