I've been doing some RNA-Seq analysis and just wanted to get some opinions from the community. I primarily work on yeast (~6000 genes) and when I have done read alignment (usually with bowtie2) and counting reads in features (with ht-seq or featurecounts) I often find that the vast majority of genes have at least one read mapped. This has got me thinking about how to tell whether a gene is expressed and how other people in the community quantify expression.
I'm not suggesting that every gene with a single read mapped is expressed, and I always include a cutoff to exclude genes with few reads mapped for any differential expression analysis, but it does raise some questions; What value or RPKM or TPM would people use to say a gene is expressed? Is it usual to find counts for the majority of genes when we might expect that only a subset of genes are functioning at any one time? If you only found say 10% of the transcriptome had reads mapped would you be skeptical of the data?
I'd be interested to hear people opinions and happy to be directed to any relevant literature.
what if we drop the dichotomy of expressed / not expressed once and for all?
it seems pretty clear to me that genes can at times show a very marginal expression. the number of counts only makes sense in a comparison with something else (ie a second condition), in my opinion.
check this question as well
Thanks. For the most part I'd be happy to not think about it as expressed/not expressed. As you say differential expression or building co-expression networks relies on comparisons or correlations of counts or expression estimates between samples and genes.
Hear, hear!
Transcription is a biochemical reaction, which is entirely dependent upon the local concentrations of reagents, catalysts, and inhibitors. While some biochemical pathways exhibit cooperativity/ultrasensitivity, they are still probabilistic rather than deterministic (all-or-none). When viewed in this light, marginal/spurious expression is to be expected.
After re-reading my old comment (somebody just left an upvote), I feel like one of the main point was not mentioned in this discussion: when analysing bulk RNA-seq we're often ignoring cell to cell differences, which can be a major source of variation. In a population of millions of cells there might be a few that express a certain -otherwise silent- gene at reasonable levels. That can lead to its overall levels to be pretty low, but a possible high-fold over-expression of this gene could eventually be meaningful, representing a relative increase in abundance of that cell subtype that expresses it.