Part-2 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction-2
Full article lifted from:
The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.
Suitable methods to preprocess the scRNA-seq is important. Here, we introduce some preprocessing step to help researchers can perform downstream analysis easyer.
User can compare our tutorial with scanpy tutorial to learn how to use omicverse well
import omicverse as ov
import scanpy as sc
The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics (here from this webpage). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.
mkdir data
wget -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
mkdir write
adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True) # write a cache file for faster subsequent reading
... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
Quantity control
For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document:
tresh={'mito_perc': 0.05, 'nUMIs': 500, 'detected_genes': 250})
Calculate QC metrics
End calculation of QC metrics.
Original cell number: 2700
Begin of post doublets removal and QC plot
Running Scrublet
filtered out 19024 genes that are detected in less than 3 cells
normalizing counts per cell
finished (0:00:00)
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
normalizing counts per cell
finished (0:00:00)
normalizing counts per cell
finished (0:00:00)
Embedding transcriptomes using PCA...
Automatically set threshold at doublet score = 0.31
Detected doublet rate = 1.4%
Estimated detectable doublet fraction = 35.1%
Overall doublet rate:
Expected = 5.0%
Estimated = 4.0%
Scrublet finished (0:00:02)
Cells retained after scrublet: 2662, 38 removed.
End of post doublets removal and QC plots.
Filters application (seurat or mads)
Lower treshold, nUMIs: 500; filtered-out-cells: 0
Lower treshold, n genes: 250; filtered-out-cells: 3
Lower treshold, mito %: 0.05; filtered-out-cells: 56
Filters applicated.
Total cell filtered out with this last --mode seurat QC (and its chosen options): 59
Cells retained after scrublet and seurat filtering: 2603, 97 removed.
filtered out 19107 genes that are detected in less than 3 cells
AnnData object with n_obs × n_vars = 2603 × 13631
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells'
uns: 'scrublet'
High variable Gene Detection
Here we try to use Pearson's method to calculate highly variable genes. This is the method that is proposed to be superior to ordinary normalisation. See Article in Nature Method for details.
Sometimes we need to recover the original counts for some single-cell calculations, but storing them in the layer layer may result in missing data, so we provide two functions here, a store function and a release function, to save the original data.
We set layers=counts
, the counts will be stored in adata.uns['layers_counts']
......The X of adata have been stored in counts
AnnData object with n_obs × n_vars = 2603 × 13631
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells'
uns: 'scrublet', 'layers_counts'
: We use | to control the preprocessing step, | before for the normalisation step, either shiftlog
or pearson
, and | after for the highly variable gene calculation step, either pearson
or seurat
. Our default is shiftlog|pearson
- if you use
you need to set target_sum=50 1e4, more people like to se target_sum=1e4, we test the result think 50 1e4 will be better - if you use
, you don't need to set target_sum
if the version of omicverse
lower than 1.4.13
, the mode can only be set between scanpy
and pearson
Begin robust gene identification
After filtration, 13631/13631 genes are kept. Among 13631 genes, 13631 genes are robust.
End of robust gene identification.
Begin size normalization: shiftlog and HVGs selection pearson
normalizing counts per cell The following highly-expressed genes are not considered during normalization factor computation:
finished (0:00:00)
extracting highly variable genes
--> added
'highly_variable', boolean vector (adata.var)
'highly_variable_rank', float vector (adata.var)
'highly_variable_nbatches', int vector (adata.var)
'highly_variable_intersection', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
End of size normalization: shiftlog and HVGs selection pearson
AnnData object with n_obs × n_vars = 2603 × 13631
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg'
layers: 'counts'
Set the .raw
attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
View of AnnData object with n_obs × n_vars = 2603 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg'
layers: 'counts'
We find that the adata.X matrix is normalized at this point, including the data in raw, but we want to get the unnormalized data, so we can use the retrieve function ov.utils.retrieve_layers
print('normalize adata:',adata.X.max())
print('raw count adata:',adata_counts.X.max())
......The X of adata have been stored in raw
......The layers counts of adata have been retreved
normalize adata: 11.381063
raw count adata: 419.0
AnnData object with n_obs × n_vars = 2603 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'layers_raw'
layers: 'counts'
If we wish to recover the original count matrix at the whole gene level, we can try the following code
print('normalize adata:',adata.X.max())
print('raw count adata:',adata_counts.X.max())
......The X of adata have been stored in raw
......The layers counts of adata have been retreved
normalize adata: 11.381063
raw count adata: 419.0
AnnData object with n_obs × n_vars = 2603 × 13631
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'layers_raw'
Principal component analysis
In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
AnnData object with n_obs × n_vars = 2603 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg'
layers: 'counts', 'scaled'
If you want to perform pca in normlog layer, you can set layer=normlog, but we think scaled is necessary in PCA.
AnnData object with n_obs × n_vars = 2603 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues'
obsm: 'scaled|original|X_pca'
varm: 'scaled|original|pca_loadings'
layers: 'counts', 'scaled', 'lognorm'