According to the SCVI tutorials, it is recommended to pre-select highly variable genes before training the SCVI model. Here is a piece of the code from here: https://docs.scvi-tools.org/en/stable/user_guide/notebooks/harmonization.html
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata # keep full dimension safe
sc.pp.highly_variable_genes(
adata,
flavor="seurat_v3",
n_top_genes=2000,
layer="counts",
batch_key="batch",
subset=True
What leaves me confused is that they set subset = True, which means they are not filtering the non-variable genes, they are just marking the highly variable ones. Then, they train the SCVI model:
scvi.data.setup_anndata(adata, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(adata)
vae.train()
How does SCVI know which are highly variable genes and which not? Is it because of the layer counts? Does anybody know if this is because the layer count only contains the highly variable genes or because the layer marks the highly variable genes in a way SCVI understand?