Dear Bounlu,
the question you pinpoint above is very crusial but also very general. For instance, regarding the field of microarrays, there are numerous kind of filterings, like initial non-specific filtering on intensity, on variance, on a detection p-value threshold that can provided directly by some platforms(i.e Illumina). Generally, the basic idea is to filter probesets-genes that are "characterized" as absent or not expressed based on one metric you used on most of your samples, or conditions (assuming "naively" that in most cases, the majority of genes are not expressed in the analyzed tissue-etc). Also, it is highly dependend on the specific kind of platform/technology used (Affymetrix,Illumina..).
On the other hand, RNA-seq is a whole different field, with its own experimental design and theory, but also "some similarities and aspects" regarding some general methodologies. Very naively, you can check from limma users guide (https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) on page 119, that a kind of filtering can be performed on the number of "total counts".
Finally, one last important aspect that i would like to emphasize, is the following: both in the literature and also in many bioinformatics groups, variance filtering is not recommended both in RNA-seq and microarrays: that is when you indend to use some of the most reliable DE methodologies(limma, edgeR, etc.) or when in your data there is a decreasing mean-variance relationship.
You could check also this very useful article (http://www.ncbi.nlm.nih.gov/pubmed/20460310)
Hope it helps
Best,
Efstathios
Thanks for the explanation. More specifically, how can you assess non-expressed genes in the Level 3 microarray expression data from TCGA? Since the data is already processed there, I think the steps you mention on filtering are already skipped. As all genes have certain level of expression value, should I assume that all of them are expressed?
I have never dealt so far with TCGA data(although i had some projects which i have not started), but as far as i have searched and read, level3 refers to post-normalization--but im not sure that in this level any kind of non-specific filtering has performed !! did you have any knowledge or information about this ? anyway, you should definately perform some density or histogram plots and inspect the range of the probeset expression values across your samples. If you see high bimodal distributions(for instance a high peak at low intensity values) then possibly no-filtering has been performed. Moreover, which specific microarray platform are you using ??