Question

how to determine n_cells_by_count

0

Entering edit mode

18 months ago

dalibenam64 • 0

Hello, I followed this tutorial (https://github.com/mousepixels/sanbomics_scripts/blob/main/single_cell_analysis_complete_class.ipynb ) in order to process single cell RNA-seq analysis using scanpy. For the first step data filtering I applied this script

def pp(csv_path):
    adata = sc.read_csv(csv_path).T
    sc.pp.highly_variable_genes(adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
    scvi.model.SCVI.setup_anndata(adata)
    vae = scvi.model.SCVI(adata)
    vae.train()
    solo = scvi.external.SOLO.from_scvi_model(vae)
    solo.train()
    df = solo.predict()
    df['prediction'] = solo.predict(soft = False)
    df.index = df.index.map(lambda x: x[:-2])
    df['dif'] = df.doublet - df.singlet
    doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]

    adata = sc.read_csv(csv_path).T
    adata.obs['Sample'] = csv_path.split('_')[2] #'raw_counts/GSM5226574_C51ctr_raw_counts.csv'

    adata.obs['doublet'] = adata.obs.index.isin(doublets.index)
    adata = adata[~adata.obs.doublet]

    #sc.pp.filter_genes(adata, min_cells=3) #get rid of genes that are found in fewer than 3 cells
    adata.var['mt'] = adata.var_names.str.startswith('mt-')  # annotate the group of mitochondrial genes as 'mt'
    adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)
    return adata
import os
out = []
for file in os.listdir('raw_counts/'):
    out.append(pp('raw_counts/' + file))

My question is how to do in order to add a column that includes n_cells_by_count because when I type adata.obs I have a table that includes information only on:

Sample
doublet
n_genes_by_counts
total_counts
total_counts_mt
pct_counts_mt
total_counts_ribo
pct_counts_ribo
total_counts_hb
pct_counts_hb

If I'm not wrong we need both n_cells_by_count and n_genes_by_count for data filtration. In other words we have to see also the distribution of these parameters in each sample in order to apply filters such as adata.obs['n_genes_by_counts'] < xxxxx

scanpy • 848 views

ADD COMMENT • link updated 16 months ago by yl759 ▴ 120 • written 18 months ago by dalibenam64 • 0

0

Entering edit mode

Shouldn't n_cells_by_counts be in adata.var? Or am I misunderstanding your question?

Also, you can filter genes and cells with sc.pp.filter_genes and sc.pp.filter_cells. You're kind of filtering n_genes_by_count by removing doublets already.

ADD REPLY • link 16 months ago by yl759 ▴ 120