Does SCVI automatically use highly variable genes?
1
0
Entering edit mode
3.8 years ago

According to the SCVI tutorials, it is recommended to pre-select highly variable genes before training the SCVI model. Here is a piece of the code from here: https://docs.scvi-tools.org/en/stable/user_guide/notebooks/harmonization.html

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata  # keep full dimension safe
sc.pp.highly_variable_genes(
    adata,
    flavor="seurat_v3",
    n_top_genes=2000,
    layer="counts",
    batch_key="batch",
    subset=True

What leaves me confused is that they set subset = True, which means they are not filtering the non-variable genes, they are just marking the highly variable ones. Then, they train the SCVI model:

scvi.data.setup_anndata(adata, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(adata)
vae.train()

How does SCVI know which are highly variable genes and which not? Is it because of the layer counts? Does anybody know if this is because the layer count only contains the highly variable genes or because the layer marks the highly variable genes in a way SCVI understand?

scRNA-seq SCVI Highly variable genes • 1.9k views
ADD COMMENT
1
Entering edit mode
3.1 years ago
valehvpa ▴ 10

Hi and thanks for reaching out. I am a member of scvi-tools and can offer some help. Please also feel free to reach us out on discourse.

The subset = True parameter indicates that we indeed want to filter to highly variable genes. Scanpy will update adata to only contain the highly variable genes (code reference). We then proceed to using the same adata object for the future tasks as you mentioned, such as training.

ADD COMMENT

Login before adding your answer.

Traffic: 2835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6