Hi guys, I have a question about how to define a gene to be expressed in RNA Seq analysis. Is it better to use log2(rpkm) or rpkm without log2 transformation to define a gene to be expressed? I know that the log is only a transformation so that log2(rpkm) will let to consider as expressed a gene with a higher rpkm with respect to the simple rpkm. However I think that rpkm alone leads to consider as expressed genes that have too few reads.
Some ideas about?
Thank you in advance
Thank you Kevin! You help me a lot every time! Thank you again!
While I have seen zFPKM used in at least one paper, I didn't realize that there was a zFPKM Bioconductor package. So, I am glad that Kevin pointed that out. However, that package says "Reference recommends using zFPKM > -3 to select expressed genes".
This alternative suggestion in the Bioconductor package is closer to what I would expect, if using FPKM of 0.1 as a rough approximation for expressed genes (or at least a rounding threshold to place less emphasis on high fold-change values in genes with low expression / counts). If prioritizing candidates for validation / future study, I might focus more on those with FPKM > 1 (possibly within a functionally relevant category), but I would expect FPKM = 1 to be closer to the mean among all genes. So, I am not sure that the standard use of |z-score| > 2 is necessarily the best strategy for defining expressed genes (there are probably a lot of expressed genes with 0 < Z < 2, for example), unless you have a separate category for your baseline (such as using normal expression for the z-score, and using disease samples to test for differences; although that could also be done with a more typical differential expression test).
Thanks for the input Charles. Yes, the thresholds can of course be modified - there will be a lot of factors going into this. There are also many other ways to identify genes that are representative of a tissue/cell.
That is a good point that I didn't previously notice - if you have a panel of cell/tissue types, I could see how a z-score per-gene could be useful for identifying cell/tissue specific markers (that could also be true per-sample, which is what I thought was being asked about, but I don't believe I've tried that before and the disease-normal z-score that I mentioned would also be per-gene rather than per-sample).