Part-1 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction
Part-2 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction-3
Full article lifted from: https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_single_batch/
scVI
An important task of single-cell analysis is the integration of several samples, which we can perform with scVI. For integration, scVI treats the data as unlabelled. When our dataset is fully labelled (perhaps in independent studies, or independent analysis pipelines), we can obtain an integration that better preserves biology using scANVI, which incorporates cell type annotation information. Here we demonstrate this functionality with an integrated analysis of cells from the lung atlas integration task from the scIB manuscript. The same pipeline would generally be used to analyze any collection of scRNA-seq datasets.
adata_scvi=ov.single.batch_correction(adata,batch_key='batch',
methods='scVI',n_layers=2, n_latent=30, gene_likelihood="nb")
adata
...Begin using scVI to correct batch effect
Global seed set to 0
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 300/300: 100%|| 300/300 [05:51<00:00, 1.18s/it, loss=1.51e+03, v_num=1]
`Trainer.fit` stopped: `max_epochs=300` reached.
Epoch 300/300: 100%|| 300/300 [05:51<00:00, 1.17s/it, loss=1.51e+03, v_num=1]
AnnData object with n_obs × n_vars = 26707 × 3000
obs: 'GEX_n_genes_by_counts', 'GEX_pct_counts_mt', 'GEX_size_factors', 'GEX_phase', 'ADT_n_antibodies_by_counts', 'ADT_total_counts', 'ADT_iso_count', 'cell_type', 'batch', 'ADT_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', 'is_train', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'n_genes', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'LDA_cluster', '_scvi_batch', '_scvi_labels'
var: 'feature_types', 'gene_id', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'batch_colors', 'cell_type_colors', 'hvg', 'layers_counts', 'log1p', 'scaled|original|cum_sum_eigenvalues', 'scaled|original|pca_var_ratios', 'scrublet', 'topic_dendogram', '_scvi_uuid', '_scvi_manager_uuid'
obsm: 'ADT_X_pca', 'ADT_X_umap', 'ADT_isotype_controls', 'GEX_X_pca', 'GEX_X_umap', 'X_combat', 'X_harmony', 'X_mde_combat', 'X_mde_harmony', 'X_mde_mira', 'X_mde_mira_feature', 'X_mde_mira_topic', 'X_mde_pca', 'X_mde_scanorama', 'X_scanorama', 'X_topic_compositions', 'X_umap_features', 'scaled|original|X_pca', 'X_scVI'
varm: 'scaled|original|pca_loadings', 'topic_feature_activations', 'topic_feature_compositions'
layers: 'counts', 'lognorm', 'scaled'
adata.obsm["X_mde_scVI"] = ov.utils.mde(adata.obsm["X_scVI"])
ov.utils.embedding(adata,
basis='X_mde_scVI',frameon='small',
color=['batch','cell_type'],show=False)
[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_scVI1', ylabel='X_mde_scVI2'>,
<AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_scVI1', ylabel='X_mde_scVI2'>]
MIRA+CODAL
Topic modeling of batched single-cell data is challenging because these models cannot typically distinguish between biological and technical effects of the assay. CODAL (COvariate Disentangling Augmented Loss) uses a novel mutual information regularization technique to explicitly disentangle these two sources of variation.
LDA_obj=ov.utils.LDA_topic(adata,feature_type='expression',
highly_variable_key='highly_variable_features',
layers='counts',batch_key='batch',learning_rate=1e-3)
INFO:mira.adata_interface.topic_model:Predicting expression from genes from col: highly_variable_features
mira have been install version: 2.1.0
Gathering dataset statistics: 0%| | 0/26707 [00:00<?, ?it/s]
Learning rate range test: 0%| | 0/98 [00:00<?, ?it/s]
INFO:mira.topic_model.base:Set learning rates to: (0.0061957597093704065, 0.22248668375233174)
LDA_obj.plot_topic_contributions(6)
Gathering dataset statistics: 0%| | 0/26707 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/24 [00:00<?, ?it/s]
Predicting latent vars: 0%| | 0/105 [00:00<?, ?it/s]
LDA_obj.predicted(15)
INFO:mira.adata_interface.topic_model:Predicting expression from genes from col: highly_variable_features
running LDA topic predicted
Gathering dataset statistics: 0%| | 0/26707 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/24 [00:00<?, ?it/s]
INFO:mira.topic_model.base:Moving model to device: cpu
Predicting latent vars: 0%| | 0/105 [00:00<?, ?it/s]
INFO:mira.adata_interface.core:Added key to obsm: X_topic_compositions
INFO:mira.adata_interface.core:Added key to obsm: X_umap_features
INFO:mira.adata_interface.topic_model:Added cols: topic_0, topic_1, topic_2, topic_3, topic_4, topic_5, topic_6, topic_7, topic_8, topic_9, topic_10, topic_11, topic_12, topic_13, topic_14
INFO:mira.adata_interface.core:Added key to varm: topic_feature_compositions
INFO:mira.adata_interface.core:Added key to varm: topic_feature_activations
INFO:mira.adata_interface.topic_model:Added key to uns: topic_dendogram
finished: found 15 clusters and added
'LDA_cluster', the cluster labels (adata.obs, categorical)
adata.obsm["X_mde_mira_topic"] = ov.utils.mde(adata.obsm["X_topic_compositions"])
adata.obsm["X_mde_mira_feature"] = ov.utils.mde(adata.obsm["X_umap_features"])
ov.utils.embedding(adata,
basis='X_mde_mira_topic',frameon='small',
color=['batch','cell_type'],show=False)
[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_mira_topic1', ylabel='X_mde_mira_topic2'>,
<AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_mira_topic1', ylabel='X_mde_mira_topic2'>]
ov.utils.embedding(adata,
basis='X_mde_mira_feature',frameon='small',
color=['batch','cell_type'],show=False)
[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_mira_feature1', ylabel='X_mde_mira_feature2'>,
<AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_mira_feature1', ylabel='X_mde_mira_feature2'>]
Benchmarking test
The methods demonstrated here are selected based on results from benchmarking experiments including the single-cell integration benchmarking project [Luecken et al., 2021]. This project also produced a software package called scib that can be used to run a range of integration methods as well as the metrics that were used for evaluation. In this section, we show how to use this package to evaluate the quality of an integration.
adata.write_h5ad('neurips2021_batch_all.h5ad',compression='gzip')
adata=sc.read('neurips2021_batch_all.h5ad')
adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca'].copy()
adata.obsm['X_mira_topic']=adata.obsm['X_topic_compositions'].copy()
adata.obsm['X_mira_feature']=adata.obsm['X_umap_features'].copy()
from scib_metrics.benchmark import Benchmarker
bm = Benchmarker(
adata,
batch_key="batch",
label_key="cell_type",
embedding_obsm_keys=["X_pca", "X_combat", "X_harmony",
'X_scanorama','X_mira_topic','X_mira_feature','X_scVI'],
n_jobs=8,
)
bm.benchmark()
bm.plot_results_table(min_max_scale=False)
<plottable.table.Table at 0x7f0d9a429d60>
We can find that harmony removes the batch effect the best of the three methods that do not use the GPU, scVI is method to remove batch effect using GPU.