Tutorial:Single-cell RNA-seq: Preprocessing: Preprocessing the data of scRNA-seq with omicverse-2
0
0
Entering edit mode
10 months ago
Julia Ma ▴ 120

Part-2 here: Single-cell RNA-seq: Preprocessing: Data integration and batch correction

Full article lifted from: https://omicverse.readthedocs.io/en/latest/Tutorials-single/t_preprocess/


Embedding the neighborhood graph

We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some occasions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
               use_rep='scaled|original|X_pca')

computing neighbors

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:01)

To visualize the PCA's embeddings, we use the pymde package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.

adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
adata

AnnData object with n_obs × n_vars = 2603 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'layers_counts', 'log1p', 'hvg', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues', 'neighbors'
    obsm: 'scaled|original|X_pca', 'X_pca', 'X_mde'
    varm: 'scaled|original|pca_loadings'
    layers: 'counts', 'scaled', 'lognorm'
    obsp: 'distances', 'connectivities'

ov.utils.embedding(adata,
                basis='X_mde',
                color='CST3',
                frameon='small')

enter image description here

You also can use umap to visualize the neighborhood graph

sc.tl.umap(adata)

computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:02)

ov.utils.embedding(adata,
                basis='X_umap',
                color='CST3',
                frameon='small')

enter image description here

Clustering the neighborhood graph

As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

sc.tl.leiden(adata)

running Leiden clustering
    finished: found 11 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter fraemon='small', which causes the axes to be scaled with the colourbar

ov.utils.embedding(adata,
                basis='X_mde',
                color=['leiden', 'CST3', 'NKG7'],
                frameon='small')

enter image description here

We also provide a boundary visualisation function ov.utils.plot_ConvexHull to visualise specific clusters.

Arguments:

  • color: if None will use the color of clusters
  • alpha: default is 0.2
import matplotlib.pyplot as plt
fig,ax=plt.subplots( figsize = (4,4))

ov.utils.embedding(adata,
                basis='X_mde',
                color=['leiden'],
                frameon='small',
                show=False,
                ax=ax)

ov.utils.plot_ConvexHull(adata,
                basis='X_mde',
                cluster_key='leiden',
                hull_cluster='0',
                ax=ax)

leiden_colors

<AxesSubplot: title={'center': 'leiden'}, xlabel='X_mde1', ylabel='X_mde2'>

enter image description here

If you have too many labels, e.g. too many cell types, and you are concerned about cell overlap, then consider trying the ov.utils.gen_mpl_labels function, which improves text overlap.

In addition, we make use of the patheffects function, which makes our text have outlines

  • adjust_kwargs: it could be found in package adjusttext
  • text_kwargs: it could be found in class plt.texts
from matplotlib import patheffects
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,4))

ov.utils.embedding(adata,
                  basis='X_mde',
                  color=['leiden'],
                   show=False, legend_loc=None, add_outline=False, 
                   frameon='small',legend_fontoutline=2,ax=ax
                 )

ov.utils.gen_mpl_labels(
    adata,
    'leiden',
    exclude=("None",),  
    basis='X_mde',
    ax=ax,
    adjust_kwargs=dict(arrowprops=dict(arrowstyle='-', color='black')),
    text_kwargs=dict(fontsize= 12 ,weight='bold',
                     path_effects=[patheffects.withStroke(linewidth=2, foreground='w')] ),
)

enter image description here

marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

sc.pl.dotplot(adata, marker_genes, groupby='leiden',
             standard_scale='var');

enter image description here

Finding marker genes

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.

sc.tl.dendrogram(adata,'leiden',use_rep='scaled|original|X_pca')
sc.tl.rank_genes_groups(adata, 'leiden', use_rep='scaled|original|X_pca',
                        method='t-test',use_raw=False,key_added='leiden_ttest')
sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
                                cmap='Spectral_r',key='leiden_ttest',
                                standard_scale='var',n_genes=3)

Storing dendrogram info using `.uns['dendrogram_leiden']`
ranking genes
    finished: added to `.uns['leiden_ttest']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

enter image description here

cosg is also considered to be a better algorithm for finding marker genes. Here, omicverse provides the calculation of cosg

Paper: Accurate and fast cell marker gene identification with COSG

Code: https://github.com/genecell/COSG

sc.tl.rank_genes_groups(adata, groupby='leiden', 
                        method='t-test',use_rep='scaled|original|X_pca',)
ov.single.cosg(adata, key_added='leiden_cosg', groupby='leiden')
sc.pl.rank_genes_groups_dotplot(adata,groupby='leiden',
                                cmap='Spectral_r',key='leiden_cosg',
                                standard_scale='var',n_genes=3)

ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)
finished identifying marker genes by COSG

enter image description here

Other plotting

Next, let's try another chart, which we call the Stacked Volcano Chart. We need to prepare two dictionaries, a data_dict and a color_dict, both of which have the same key requirements.

For data_dict. we require the contents within each key to be a DataFrame containing ['names','logfoldchanges','pvals_adj'], where names stands for gene names, logfoldchanges stands for differential expression multiplicity, pvals_adj stands for significance p-value

data_dict={}
for i in adata.obs['leiden'].cat.categories:
    data_dict[i]=sc.get.rank_genes_groups_df(adata, group=i, key='leiden_ttest',
                                            pval_cutoff=None,log2fc_min=None)

data_dict.keys()

dict_keys(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10'])

data_dict[i].head()
names scores logfoldchanges pvals pvals_adj
0 PF4 92.918243 17.017431 4.248822e-15 7.122920e-15
1 PPBP 85.572166 17.316359 6.224002e-15 1.041674e-14
2 SDPR 72.907913 16.093300 5.214614e-14 8.662149e-14
3 GNG11 72.813034 16.640303 5.858609e-14 9.723832e-14
4 GPX1 62.821903 9.493515 2.581554e-24 4.875457e-24

For color_dict, we require that the colour to be displayed for the current key is stored within each key.

type_color_dict=dict(zip(adata.obs['leiden'].cat.categories,
                         adata.uns['leiden_colors']))
type_color_dict

{'0': '#1f77b4',
 '1': '#ff7f0e',
 '2': '#279e68',
 '3': '#d62728',
 '4': '#aa40fc',
 '5': '#8c564b',
 '6': '#e377c2',
 '7': '#b5bd61',
 '8': '#17becf',
 '9': '#aec7e8',
 '10': '#ffbb78'}

There are a number of parameters available here for us to customise the settings. Note that when drawing stacking_vol with omicverse version less than 1.4.13, there is a bug that the vertical coordinate is constant at [-15,15], so we have added some code in this tutorial for visualisation.

  • data_dict: dict, in each key, there is a dataframe with columns of ['logfoldchanges','pvals_adj','names']
  • color_dict: dict, in each key, there is a color for each omic
  • pval_threshold: float, pvalue threshold for significant genes
  • log2fc_threshold: float, log2fc threshold for significant genes
  • figsize: tuple, figure size
  • sig_color: str, color for significant genes
  • normal_color: str, color for non-significant genes
  • plot_genes_num: int, number of genes to plot
  • plot_genes_fontsize: int, fontsize for gene names
  • plot_genes_weight: str, weight for gene names
fig,axes=ov.utils.stacking_vol(data_dict,type_color_dict,
            pval_threshold=0.01,
            log2fc_threshold=2,
            figsize=(8,4),
            sig_color='#a51616',
            normal_color='#c7c7c7',
            plot_genes_num=2,
            plot_genes_fontsize=6,
            plot_genes_weight='bold',
            )

#The following code will be removed in future
y_min,y_max=0,0
for i in data_dict.keys():
    y_min=min(y_min,data_dict[i]['logfoldchanges'].min())
    y_max=max(y_max,data_dict[i]['logfoldchanges'].max())
for i in adata.obs['leiden'].cat.categories:
    axes[i].set_ylim(y_min,y_max)
plt.suptitle('Stacking_vol',fontsize=12)   

0 2
2 4
4 6
6 8
8 10
10 12
12 14
14 16
16 18
18 20
20 22

Text(0.5, 0.98, 'Stacking_vol')

enter image description here

scRNA-seq • 618 views
ADD COMMENT

Login before adding your answer.

Traffic: 1651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6