Question

Rna Seq Mrna|Gene Count Data Filteration

3

Entering edit mode

13.4 years ago

Sudeep ★ 1.7k

Hi all,

I have a question regarding rna-seq data filtering, once sample mrna to known mrna mapping and filtering is done, is it a good idea to remove mrna or gene hits that rarely occurs in all the samples ? I read about it in this edgeR tutorial, where the gene hits with less than rpm count 1 occurring in less than two samples are removed . If it is a good idea what would be the common methods to look for ?

thanks in advance

rna next-gen sequencing data filter • 5.4k views

ADD COMMENT • link updated 13.1 years ago by Duff ▴ 670 • written 13.4 years ago by Sudeep ★ 1.7k

score 6 · Answer 1 · 2011-11-14

Hi,

I am generally not at ease with filtering, moreover with RNA-Seq data. This makes quite some sense for microarray data since non-expressed genes also had a low intensity signal (even though there is no gold standard methods of filtering).

For RNA-Seq, there is no such drawback and the presence of at least one unambiguously mapped read on a gene should normally reflect an evidence of transcription. Filtering very lowly transcribed genes makes you assume that those genes are not functional but rather transcriptional noise. That might be true for some cases (maybe most of them) but there is, to my knowledge, no clear evidence about that.

My opinion would rather be to control if the results of your analyses are not biased by such genes, dividing your initial gene set in several bins of expression. If the bins containing lowly expressed genes show a pattern similar to bins containing genes with intermediate or high expression this shows that rare transcripts do not influence the results of your analysis. On the contrary, if such differences are observed it is more difficult to draw conclusions since the difference could be explained by several biological or technical parameters that can not necessarily be distinguished.

score 4 · Answer 2 · 2012-03-08

Hi Sudeep

I think that the edgeR authors recommend filtering the data such that genes with less than one count across half the samples are removed because they cannot achieve statistical significance. In the first vignette example in the edgeR documentation they say:

"We will filter out very lowly expressed tags. Those which have fewer than 5 counts in total cannot possibly achieve statisical significance for DE, so we filter out these tags."

So, if using edgeR why keep in genes that are expressed at low levels (and may be expressed in some samples but not others under the same conditions) and that can't give you any information about regulation? These just become statistical noise. The biology however may be relevant as Philippe points out - it's just you can't say anything statistically regarding regulation (well with edgeR anyway).

Best

duff