Hello, I followed this tutorial (https://github.com/mousepixels/sanbomics_scripts/blob/main/single_cell_analysis_complete_class.ipynb ) in order to process single cell RNA-seq analysis using scanpy. For the first step data filtering I applied this script
def pp(csv_path):
adata = sc.read_csv(csv_path).T
sc.pp.highly_variable_genes(adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
scvi.model.SCVI.setup_anndata(adata)
vae = scvi.model.SCVI(adata)
vae.train()
solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train()
df = solo.predict()
df['prediction'] = solo.predict(soft = False)
df.index = df.index.map(lambda x: x[:-2])
df['dif'] = df.doublet - df.singlet
doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]
adata = sc.read_csv(csv_path).T
adata.obs['Sample'] = csv_path.split('_')[2] #'raw_counts/GSM5226574_C51ctr_raw_counts.csv'
adata.obs['doublet'] = adata.obs.index.isin(doublets.index)
adata = adata[~adata.obs.doublet]
#sc.pp.filter_genes(adata, min_cells=3) #get rid of genes that are found in fewer than 3 cells
adata.var['mt'] = adata.var_names.str.startswith('mt-') # annotate the group of mitochondrial genes as 'mt'
adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)
return adata
import os
out = []
for file in os.listdir('raw_counts/'):
out.append(pp('raw_counts/' + file))
My question is how to do in order to add a column that includes n_cells_by_count
because when I type adata.obs
I have a table that includes information only on:
Sample
doublet
n_genes_by_counts
total_counts
total_counts_mt
pct_counts_mt
total_counts_ribo
pct_counts_ribo
total_counts_hb
pct_counts_hb
If I'm not wrong we need both n_cells_by_count
and n_genes_by_count
for data filtration. In other words we have to see also the distribution of these parameters in each sample in order to apply filters such as adata.obs['n_genes_by_counts'] < xxxxx
Shouldn't
n_cells_by_counts
be inadata.var
? Or am I misunderstanding your question?Also, you can filter genes and cells with
sc.pp.filter_genes
andsc.pp.filter_cells
. You're kind of filteringn_genes_by_count
by removing doublets already.