Part-1 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction
Part-3 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction-3
Full article lifted from: https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_single_batch/
Harmony
Harmony is an algorithm for performing integration of single cell genomics datasets. Please check out manuscript on Nature Methods.
The function ov.single.batch_correction
can be set in three methods: harmony
, combat
and scanorama
adata_harmony=ov.single.batch_correction(adata,batch_key='batch',
methods='harmony',n_pcs=50)
adata
...Begin using harmony to correct batch effect
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
2023-11-19 20:25:03,351 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
INFO:harmonypy:Computing initial centroids with sklearn.KMeans...
2023-11-19 20:25:12,444 - harmonypy - INFO - sklearn.KMeans initialization complete.
INFO:harmonypy:sklearn.KMeans initialization complete.
2023-11-19 20:25:12,725 - harmonypy - INFO - Iteration 1 of 10
INFO:harmonypy:Iteration 1 of 10
2023-11-19 20:25:19,161 - harmonypy - INFO - Iteration 2 of 10
INFO:harmonypy:Iteration 2 of 10
2023-11-19 20:25:25,779 - harmonypy - INFO - Iteration 3 of 10
INFO:harmonypy:Iteration 3 of 10
2023-11-19 20:25:32,350 - harmonypy - INFO - Iteration 4 of 10
INFO:harmonypy:Iteration 4 of 10
2023-11-19 20:25:38,889 - harmonypy - INFO - Iteration 5 of 10
INFO:harmonypy:Iteration 5 of 10
2023-11-19 20:25:43,768 - harmonypy - INFO - Converged after 5 iterations
INFO:harmonypy:Converged after 5 iterations
AnnData object with n_obs × n_vars = 26707 × 3000
obs: 'GEX_n_genes_by_counts', 'GEX_pct_counts_mt', 'GEX_size_factors', 'GEX_phase', 'ADT_n_antibodies_by_counts', 'ADT_total_counts', 'ADT_iso_count', 'cell_type', 'batch', 'ADT_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', 'is_train', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'n_genes', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'LDA_cluster'
var: 'feature_types', 'gene_id', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'batch_colors', 'cell_type_colors', 'topic_dendogram'
obsm: 'ADT_X_pca', 'ADT_X_umap', 'ADT_isotype_controls', 'GEX_X_pca', 'GEX_X_umap', 'scaled|original|X_pca', 'X_mde_pca', 'X_topic_compositions', 'X_umap_features', 'X_mde_mira', 'X_mde_mira_topic', 'X_mde_mira_feature', 'X_harmony'
varm: 'scaled|original|pca_loadings', 'topic_feature_compositions', 'topic_feature_activations'
layers: 'counts', 'scaled', 'lognorm'
adata.obsm["X_mde_harmony"] = ov.utils.mde(adata.obsm["X_harmony"])
ov.utils.embedding(adata,
basis='X_mde_harmony',frameon='small',
color=['batch','cell_type'],show=False)
[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_harmony1', ylabel='X_mde_harmony2'>,
<AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_harmony1', ylabel='X_mde_harmony2'>]
Combat
combat is a batch effect correction method that is very widely used in bulk RNA-seq, and it works just as well on single-cell sequencing data.
adata_combat=ov.single.batch_correction(adata,batch_key='batch',
methods='combat',n_pcs=50)
adata
...Begin using combat to correct batch effect
Standardizing Data across genes.
Found 3 batches
Found 0 numerical variables:
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting data
AnnData object with n_obs × n_vars = 26707 × 3000
obs: 'GEX_n_genes_by_counts', 'GEX_pct_counts_mt', 'GEX_size_factors', 'GEX_phase', 'ADT_n_antibodies_by_counts', 'ADT_total_counts', 'ADT_iso_count', 'cell_type', 'batch', 'ADT_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', 'is_train', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'n_genes', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'LDA_cluster'
var: 'feature_types', 'gene_id', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'batch_colors', 'cell_type_colors', 'topic_dendogram'
obsm: 'ADT_X_pca', 'ADT_X_umap', 'ADT_isotype_controls', 'GEX_X_pca', 'GEX_X_umap', 'scaled|original|X_pca', 'X_mde_pca', 'X_topic_compositions', 'X_umap_features', 'X_mde_mira', 'X_mde_mira_topic', 'X_mde_mira_feature', 'X_harmony', 'X_mde_harmony', 'X_combat'
varm: 'scaled|original|pca_loadings', 'topic_feature_compositions', 'topic_feature_activations'
layers: 'counts', 'scaled', 'lognorm'
adata.obsm["X_mde_combat"] = ov.utils.mde(adata.obsm["X_combat"])
ov.utils.embedding(adata,
basis='X_mde_combat',frameon='small',
color=['batch','cell_type'],show=False)
[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_combat1', ylabel='X_mde_combat2'>,
<AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_combat1', ylabel='X_mde_combat2'>]
scanorama
Integration of single-cell RNA sequencing (scRNA-seq) data from multiple experiments, laboratories and technologies can uncover biological insights, but current methods for scRNA-seq data integration are limited by a requirement for datasets to derive from functionally similar cells. We present Scanorama, an algorithm that identifies and merges the shared cell types among all pairs of datasets and accurately integrates heterogeneous collections of scRNA-seq data.
adata_scanorama=ov.single.batch_correction(adata,batch_key='batch',
methods='scanorama',n_pcs=50)
adata
...Begin using scanorama to correct batch effect
s1d3
s2d1
s3d7
Found 3000 genes among all datasets
[[0. 0.50093205 0.5758346 ]
[0. 0. 0.60733037]
[0. 0. 0. ]]
Processing datasets (1, 2)
Processing datasets (0, 2)
Processing datasets (0, 1)
(26707, 50)
AnnData object with n_obs × n_vars = 26707 × 3000
obs: 'GEX_n_genes_by_counts', 'GEX_pct_counts_mt', 'GEX_size_factors', 'GEX_phase', 'ADT_n_antibodies_by_counts', 'ADT_total_counts', 'ADT_iso_count', 'cell_type', 'batch', 'ADT_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker', 'is_train', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'n_genes', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'LDA_cluster'
var: 'feature_types', 'gene_id', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'batch_colors', 'cell_type_colors', 'topic_dendogram'
obsm: 'ADT_X_pca', 'ADT_X_umap', 'ADT_isotype_controls', 'GEX_X_pca', 'GEX_X_umap', 'scaled|original|X_pca', 'X_mde_pca', 'X_topic_compositions', 'X_umap_features', 'X_mde_mira', 'X_mde_mira_topic', 'X_mde_mira_feature', 'X_harmony', 'X_mde_harmony', 'X_combat', 'X_mde_combat', 'X_scanorama'
varm: 'scaled|original|pca_loadings', 'topic_feature_compositions', 'topic_feature_activations'
layers: 'counts', 'scaled', 'lognorm'
adata.obsm["X_mde_scanorama"] = ov.utils.mde(adata.obsm["X_scanorama"])
ov.utils.embedding(adata,
basis='X_mde_scanorama',frameon='small',
color=['batch','cell_type'],show=False)
[<AxesSubplot: title={'center': 'batch'}, xlabel='X_mde_scanorama1', ylabel='X_mde_scanorama2'>,
<AxesSubplot: title={'center': 'cell_type'}, xlabel='X_mde_scanorama1', ylabel='X_mde_scanorama2'>]
Wouldn't it make much more sense to make one single post that lists all available tutorials you have and then just give a link to the GitHub repo that stores the ipynbs? You make a flood of posts here, it is hard to follow, and tutorials spanning several posts (like part 1/2/3) are also tedious to link in other threads. I don't see the point.
Your image for both the harmony and combat approaches are the same. From your original page, it is evident that you need different images. Please fix this.