Part-2 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction
Full article lifted from: https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_preprocess/
Embedding the neighborhood graph
We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some occasions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
use_rep='scaled|original|X_pca')
computing neighbors
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:01)
To visualize the PCA's embeddings, we use the pymde
package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.
adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
adata
AnnData object with n_obs × n_vars = 2603 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'neighbors'
obsm: 'scaled|original|X_pca', 'X_pca', 'X_mde'
varm: 'scaled|original|pca_loadings'
layers: 'counts', 'scaled', 'lognorm'
obsp: 'distances', 'connectivities'
ov.utils.embedding(adata,
basis='X_mde',
color='CST3',
frameon='small')
You also can use umap
to visualize the neighborhood graph
sc.tl.umap(adata)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:02)
ov.utils.embedding(adata,
basis='X_umap',
color='CST3',
frameon='small')
Clustering the neighborhood graph
As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
sc.tl.leiden(adata)
running Leiden clustering
finished: found 11 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter fraemon='small'
, which causes the axes to be scaled with the colourbar
ov.utils.embedding(adata,
basis='X_mde',
color=['leiden', 'CST3', 'NKG7'],
frameon='small')
We also provide a boundary visualisation function ov.utils.plot_ConvexHull
to visualise specific clusters.
Arguments:
color
: if None will use the color of clustersalpha
: default is 0.2
import matplotlib.pyplot as plt
fig,ax=plt.subplots( figsize = (4,4))
ov.utils.embedding(adata,
basis='X_mde',
color=['leiden'],
frameon='small',
show=False,
ax=ax)
ov.utils.plot_ConvexHull(adata,
basis='X_mde',
cluster_key='leiden',
hull_cluster='0',
ax=ax)
leiden_colors
<AxesSubplot: title={'center': 'leiden'}, xlabel='X_mde1', ylabel='X_mde2'>
If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the ov.utils.gen_mpl_labels
function, which improves text overlap.
In addition, we make use of the patheffects
function, which makes our text have outlines
adjust_kwargs
: it could be found in packageadjusttext
text_kwargs
: it could be found in classplt.texts
from matplotlib import patheffects
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))
ov.utils.embedding(adata,
basis='X_mde',
color=['leiden'],
show=False, legend_loc=None, add_outline=False,
frameon='small',legend_fontoutline=2,ax=ax
)
ov.utils.gen_mpl_labels(
adata,
'leiden',
exclude=("None",),
basis='X_mde',
ax=ax,
adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
text_kwargs=dict(fontsize= 12 ,weight='bold',
path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
sc.pl.dotplot(adata, marker_genes, groupby='leiden',
standard_scale='var');
Finding marker genes
Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
method='t-test',use_raw=False,key_added='leiden_ttest')
sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
cmap='Spectral_r',key='leiden_ttest',
standard_scale='var',n_genes=3)
Storing dendrogram info using `.uns['dendrogram_leiden']`
ranking genes
finished: added to `.uns['leiden_ttest']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
cosg
is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg
Paper: Accurate and fast cell marker gene identification with COSG
Code: https://github.com/genecell/COSG
sc.tl.rank_genes_groups(adata, groupby='leiden',
method='t-test',use_rep='scaled|original|X_pca',)
ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
cmap='Spectral_r',key='leiden_cosg',
standard_scale='var',n_genes=3)
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
finished identifying marker genes by COSG
Other plotting
Next, let's try another chart, which we call the Stacked Volcano Chart. We need to prepare two dictionaries, a data_dict and a color_dict, both of which have the same key requirements.
For data_dict
. we require the contents within each key to be a DataFrame containing ['names','logfoldchanges','pvals_adj'
], where names
stands for gene names, logfoldchanges
stands for differential expression multiplicity, pvals_adj
stands for significance p-value
data_dict={}
for i in adata.obs['leiden'].cat.categories:
data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
pval_cutoff=None,log2fc_min=None)
data_dict.keys()
dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10'])
data_dict[i].head()
names | scores | logfoldchanges | pvals | pvals_adj | |
---|---|---|---|---|---|
0 | PF4 | 92.918243 | 17.017431 | 4.248822e-15 | 7.122920e-15 |
1 | PPBP | 85.572166 | 17.316359 | 6.224002e-15 | 1.041674e-14 |
2 | SDPR | 72.907913 | 16.093300 | 5.214614e-14 | 8.662149e-14 |
3 | GNG11 | 72.813034 | 16.640303 | 5.858609e-14 | 9.723832e-14 |
4 | GPX1 | 62.821903 | 9.493515 | 2.581554e-24 | 4.875457e-24 |
For color_dict
, we require that the colour to be displayed for the current key is stored within each key.
type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
adata.uns['leiden_colors']))
type_color_dict
{'0': '#1f77b4',
'1': '#ff7f0e',
'2': '#279e68',
'3': '#d62728',
'4': '#aa40fc',
'5': '#8c564b',
'6': '#e377c2',
'7': '#b5bd61',
'8': '#17becf',
'9': '#aec7e8',
'10': '#ffbb78'}
There are a number of parameters available here for us to customise the settings. Note that when drawing stacking_vol with omicverse version less than 1.4.13, there is a bug that the vertical coordinate is constant at [-15,15], so we have added some code in this tutorial for visualisation.
data_dict
: dict, in each key, there is a dataframe with columns of ['logfoldchanges','pvals_adj','names']color_dict
: dict, in each key, there is a color for each omicpval_threshold
: float, pvalue threshold for significant geneslog2fc_threshold
: float, log2fc threshold for significant genesfigsize
: tuple, figure sizesig_color
: str, color for significant genesnormal_color
: str, color for non-significant genesplot_genes_num
: int, number of genes to plotplot_genes_fontsize
: int, fontsize for gene namesplot_genes_weight
: str, weight for gene names
fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
pval_threshold=0.01,
log2fc_threshold=2,
figsize=(8,4),
sig_color='#a51616',
normal_color='#c7c7c7',
plot_genes_num=2,
plot_genes_fontsize=6,
plot_genes_weight='bold',
)
#The following code will be removed in future
y_min,y_max=0,0
for i in data_dict.keys():
y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
for i in adata.obs['leiden'].cat.categories:
axes[i].set_ylim(y_min,y_max)
plt.suptitle('Stacking_vol',fontsize=12)
0 2
2 4
4 6
6 8
8 10
10 12
12 14
14 16
16 18
18 20
20 22
Text(0.5, 0.98, 'Stacking_vol')