Question

Filtering out low expressed genes in RNA-Seq data

1

Entering edit mode

7.7 years ago

lessismore ★ 1.4k

Hello,

Which is the most wise way to filter out low expressed genes (TPM) from a RNA-seq dataset ? Ive seen some empirically based methods that did not totally convince me.

Whats your opinion about it? thanks in advance

RNA-Seq preprocessing TPM • 18k views

ADD COMMENT • link updated 7.7 years ago by i.sudbery 21k • written 7.7 years ago by lessismore ★ 1.4k

0

Entering edit mode

I have tried TPM, RPKM both. There was bias wrt gene size. I wonder if more filters are required.

ADD REPLY • link 7.7 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Can you be more specific?

ADD REPLY • link 7.7 years ago by lessismore ★ 1.4k

0

Entering edit mode

1 fpkm is a standard filter.

ADD REPLY • link 7.7 years ago by Pappu ★ 2.1k

score 4 · Answer 1 · 2017-09-28

4

Entering edit mode

7.7 years ago

i.sudbery 21k

I depends on what your downstream analysis is. If your aim is to filter low expressed genes to increase power in a differential expression analysis, I recommend reading

Data-driven hypothesis weighting increases detection power in genome-scale multiple testing

If you want to divide genes into expressed and non-expressed for a biological reason, there there really isn't a good way to do it. I rule of thumb might be:

There are about 200,000 transcript molecules in a cell at anyone time (very approximate, order of magnitude type estimate). THus a TPM of 5 represents about 1 transcript per cell (average).

If you are interested if read counts is above background noise (e.g. perhaps they are contaminating DNA molecules in your library preps), you could use the method described here.

ADD COMMENT • link 7.7 years ago by i.sudbery 21k

0

Entering edit mode

Hey, thanks for your suggestions. I am interested in doing that for a co-expression network analysis, in particular i would be interested in only the positively correlated genes. What do you think?

ADD REPLY • link 7.7 years ago by lessismore ★ 1.4k

2

Entering edit mode

I guess the worry here is that lowly expressed genes have more noise and thus will screw up the correlations. I'm not sure anyone has every really considered this question, nor can I think of a principled approach.

The key thing with correlations is probably to get rid of the zeros. Too many zeros can cause a real problem. Other than that, you probably want to keep some pretty lowly expressed genes: you can't have a correlation if you only keep high expressed genes. You could use the simulate and local FDR method I linked to above, but I'm guess its not worth the bother. Your results are unlikely to be signficantly different to if you had just used a 1 TPM type threshold. Remember, the aim of bioinformatics is to extract biologically meaningful results, but be mathematically 100% correct. If a trend in your data is strong enough to be biologically meaningful, its probably strong enough to be insensitive to a range of sensible expression thresholds.

ADD REPLY • link 7.7 years ago by i.sudbery 21k