Question

Forum:scRNA-seq: Do people eliminate housekeeping genes for the sake of increasing algorithmic efficiency (and this speed)?

0

Entering edit mode

4 months ago

Daniel • 0

Title. An undergraduate student I am working with had this thought. It seemed to make sense to me. Based on a quick google search, and asking copilot+gpt-4o, it doesn't seem like this is a common practice. My question is, why don't people do this? The point he made was that if these genes are present in all cells, they might be non-informative, and thus could be eliminated to speed things up.

housekeeping-genes scRNA-seq • 724 views

ADD COMMENT • link updated 4 months ago by Ram 44k • written 4 months ago by Daniel • 0

score 1 · Answer 1 · 2024-07-09

I think the problem here is with the term "housekeeping gene". This assumes that there are a well defined set of genes that are expressed at the same level in every cell, but as far as I'm aware, no one has been able to identify should a group of genes. Yes, there are important genes that must be expressed in every cell, but that doesn't mean they are neccessarily expressed at the same level. Thus, as noted by @yora.grabovska, it is better to determine which genes are invariant from the data itself, rather than taking a guess based on prior knowledge (thats really what true housekeeping genes are, statistically - invariant genes).

Its also worth noting that for a lot of our methods, there is an assumption that most genes don't change (I'm paritcularly thinking of DESeq2/limma/edgeR type algorithmns, but this also applies to many normalisation methods): they need that mass of unchanging genes in order to normalise the data and calibrate their results.

But your undergrad is not totally off base: Reducing the number of genes in an analysis has a more important effect than reducing computation time, it also increases statistical power in analyses that give per gene p-values, by reducing the multiple testing burden, so there is a balance/trade off.

score 0 · Answer 2 · 2024-07-08

When we run scRNA-seq analysis, and I'm assuming you're specifically discussing expression, we identify hypervariable genes and then run dimensionality reduction and clustering using only those genes. We don't remove the rest of the data from the experiment but we define our clustering based on a set of the most variable genes. In that sense removing housekeeping genes wouldn't increase any algorithmic efficiency unless they are within your hypervariable set in which case it would defeat the point of your statement.

When we calculate differential expression it's not necessarily the number of features but more the number of observations that has the most impact on computation time - in the sense of a traditional genome-wide expression experiment. Obviously if you had 100,000 features vs 10,00 features, there would be a noticeable effect on computation though our methods are pretty efficient these days. But removing a handful of housekeeping genes doesn't speed things up enough that you would justify removing fearures in a supervised way without strong biological rationalle.

There are some examples of this where you might want to remove some genes from your set of hypervariable genes - in the case of MALAT1 for example where, for some cells, more than 20-40% or even more of total cell expression can be dominated by non-coding RNA like MALAT1