to regress out cell cycle influence or other "noise". But I am confused why we can not drop these genes in feature selection steps. For example, we can drop these genes from high variable gene set, so that protecting our latter PCA and cluster analysis from these influence genes.
But I am confused why we can not drop these genes in feature selection steps.
You absolutely can. This is even the most "mild" method as it does not alter the counts of all other genes. I personally prefer to do exactly that. This is a good idea if you see that e.g. cell cycle drives some unwanted cluster separation purely based on these genes. Here is an example from a yet unpublished project of mine in which cell cycle was driving some unwanted cluster separation.
tl;dr
You see in the below plot our dataset colored by clusters (A) and colored by three canonical lineage markers (B-D) for the celltypes we were interested in. Basically what you see is that the UMAP is separated into two large "islands" of clusters (that is shown in E as left/right group) and each of these islands contained cells that highly expressed these lineage markers. That means cells were somehow separated by a factor we did not know about. We observed then that the left group expressed the cell cycle gene Mki67 highly and the right group lowly. Mki67 is canonically low for cells in G1 phase and this suggested us that the "right" group was mainly in G1 phase and the "left" one in non-G1/mixed phases. This was confirmed by running the cyclone classifier from scran (F), and by running differential expression between "left" and "right" with DE genes being enriched for cell-cycle realted terms (G).
That having said, simply dropping the DE genes from left_vs_right from the highly variable genes was completely sufficient to remove this separation into the two islands. Rerunning the clustering and UMAPing after the removal resulted in a separation driven by the expected cell type differences with no sign of cell cycle confounding as before. In this case we droppe about 70 genes out of 1000 highly variable ones. Of note, it was really necessary to do the DE analysis. Simply dropping genes that were annotated in the "Cell Cycle" GO-term was not sufficient to remove that confounding, so it really must be data-driven.
That is an example were simply dropping genes was sufficient. Regression would probably be preferred when you have evidence that many other non-cell cycle genes were co-regulated by the cell cycle phase so removing cc genes alone would not be enough but I do not have any dataset or example to demonstrate that.
Also see OSCA (http://bioconductor.org/books/3.15/OSCA.advanced/cell-cycle-assignment.html) for a discussion on cell cycle. I basically agree with @jared.andrews07 above that regression (and generally methods that alter the counts of all genes) should only be applied of there is good data-driven evidence that this is necessary and beneficial.
Great answer. We actually just got data back where a cell cycle shift is the main phenotype we see:
But notably, if we'd regressed this out, we wouldn't have seen it. We could now, however, follow @atpoint's suggestions to remove them from the variable genes prior to DE to better pin down non-cellcycle related changes.
I still always advise people not to regress out cell cycle status or info. These can be biologically interesting and a big part of many phenotypes. Differing proportions of cycling cells between conditions or samples is both low hanging fruit and something that's easily validated at the bench. I really don't get why people try to remove that, as it's very simple to label populations as "cycling monocytes" and "monocytes" (or whatever) and create supersets as necessary.
I'd prefer that Seurat drop that part of their cell cycle vignette, I have yet to see a case where it's actually helpful.
I wonder what's your opinion about a common in my workplace practice of routinely regressing out the nFeatures, nCounts and mt.pct variables from all datasets. I haven't seen any tutorials online doing that...
nCounts and mt.pct seem reasonable, as that should just help mix lower quality/dying cells with more intact ones. nFeatures is a little dicey, as there are some cell types/states that just generally have fewer genes expressed than others, so you're potentially obfuscating real biology by regressing that out.
Great answer. We actually just got data back where a cell cycle shift is the main phenotype we see:
But notably, if we'd regressed this out, we wouldn't have seen it. We could now, however, follow @atpoint's suggestions to remove them from the variable genes prior to DE to better pin down non-cellcycle related changes.